You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mathias De Maré <ma...@gmail.com> on 2009/08/05 09:02:59 UTC
Some tasks fail to report status between the end of the map and the
beginning of the merge
Hi,
I'm having some problems (Hadoop 0.20.0) where map tasks fail to report
status for 10 minutes and get killed eventually. All of the tasks output
around the same amount of data, some only take a few seconds before starting
the 'merge' on the segments, but some seem to fail by just stopping to work
for about 10 minutes.
Several of these failed tasks eventually do succeed, on their 2nd or 3rd
task attempt. It's nice to see them succeed eventually, but each of those
tasks crawls a few thousand websites, and it seems like a terrible waste to
let them retry a few times, in the meantime downloading all of those
websites again, after 10 minutes of doing nothing.
Even more annoyingly, eventually, one of the tasks fails completely, which
then kills the entire job.
I could probably increase the amount of task attempts and simply hope the
tasks will succeed eventually, but that doesn't solve the huge slowdowns and
the recrawling required.
Here's an example of a successful task attempt (this is attempt 3 of a
specific task -- note that it takes around 8 seconds between spill 133 and
spill 134):
2009-08-04 18:38:48,059 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 129
2009-08-04 18:39:00,626 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 18:39:00,626 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 2492163; bufend = 628913; bufvoid = 2988446
2009-08-04 18:39:00,626 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 9727; kvend = 7760; length = 9830
2009-08-04 18:39:01,467 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 130
2009-08-04 18:39:08,136 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 18:39:08,136 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 628913; bufend = 1880222; bufvoid = 2988448
2009-08-04 18:39:08,136 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 7760; kvend = 5793; length = 9830
2009-08-04 18:39:08,463 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 131
2009-08-04 18:39:12,456 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 18:39:12,459 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 1880222; bufend = 136018; bufvoid = 2988448
2009-08-04 18:39:12,459 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 5793; kvend = 3826; length = 9830
2009-08-04 18:39:12,697 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 132
2009-08-04 18:39:23,138 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 18:39:23,138 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 136018; bufend = 1347353; bufvoid = 2988448
2009-08-04 18:39:23,138 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 3826; kvend = 1859; length = 9830
2009-08-04 18:39:25,747 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 133
2009-08-04 18:47:49,823 INFO org.apache.hadoop.mapred.MapTask:
Starting flush of map output
2009-08-04 18:47:50,132 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 134
2009-08-04 18:47:50,525 INFO org.apache.hadoop.mapred.Merger: Merging
135 sorted segments
2009-08-04 18:47:50,528 INFO org.apache.hadoop.mapred.Merger: Merging
9 intermediate segments out of a total of 135
2009-08-04 18:47:52,224 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 127
2009-08-04 18:47:53,837 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 118
2009-08-04 18:47:55,417 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 109
2009-08-04 18:47:56,990 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 100
2009-08-04 18:47:58,492 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 91
2009-08-04 18:48:00,191 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 82
2009-08-04 18:48:02,315 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 73
2009-08-04 18:48:04,184 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 64
2009-08-04 18:48:06,162 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 55
2009-08-04 18:48:08,149 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 46
2009-08-04 18:48:09,888 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 37
2009-08-04 18:48:11,744 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 28
2009-08-04 18:48:13,544 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 19
2009-08-04 18:48:21,177 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 10 segments left of total size: 153426836
bytes
2009-08-04 18:48:46,226 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_200908041626_0001_m_000133_2 is done. And is in the
process of commiting
2009-08-04 18:48:46,373 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_200908041626_0001_m_000133_2' done.
And here the same task, but an earlier attempt (it only goes up to spill
132, but those do vary slightly in size, right?):
2009-08-04 17:15:49,112 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 321905; bufend = 1464691; bufvoid = 2988448
2009-08-04 17:15:49,112 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 7764; kvend = 5797; length = 9830
2009-08-04 17:15:49,910 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 126
2009-08-04 17:16:00,785 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:16:00,785 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 1464691; bufend = 2580856; bufvoid = 2988448
2009-08-04 17:16:00,785 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 5797; kvend = 3830; length = 9830
2009-08-04 17:16:05,728 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 127
2009-08-04 17:16:20,883 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:16:20,883 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 2580856; bufend = 1000978; bufvoid = 2988448
2009-08-04 17:16:20,883 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 3830; kvend = 1863; length = 9830
2009-08-04 17:16:21,709 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 128
2009-08-04 17:16:24,461 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:16:24,461 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 1000978; bufend = 2246166; bufvoid = 2988448
2009-08-04 17:16:24,461 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 1863; kvend = 9727; length = 9830
2009-08-04 17:16:24,934 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 129
2009-08-04 17:16:40,853 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:16:40,853 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 2246166; bufend = 374750; bufvoid = 2988435
2009-08-04 17:16:40,853 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 9727; kvend = 7760; length = 9830
2009-08-04 17:16:43,740 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 130
2009-08-04 17:16:49,141 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:16:49,141 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 374750; bufend = 1653302; bufvoid = 2988448
2009-08-04 17:16:49,141 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 7760; kvend = 5793; length = 9830
2009-08-04 17:16:49,868 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 131
2009-08-04 17:17:04,902 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:17:04,902 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 1653302; bufend = 2882491; bufvoid = 2988448
2009-08-04 17:17:04,902 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 5793; kvend = 3826; length = 9830
2009-08-04 17:17:05,355 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 132
There's no more output after that last spill, since the task eventually got
killed after 10 minutes of nothing.
And in case it can help, here's a thread dump for one of those failed tasks:
2009-08-04 18:53:23,460 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200908041626_0001_m_000135_3: Task
attempt_200908041626_0001_m_000135_3 failed to report status for 602
seconds. Killing!
2009-08-04 18:53:23,470 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task
31 active threads
Thread 1673 (process reaper):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.lang.UNIXProcess.waitForProcessExit(Native Method)
java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 1672 (JVM Runner jvm_200908041626_0001_m_1938204689 spawned.):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.UNIXProcess@155d434
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
org.apache.hadoop.util.Shell.run(Shell.java:134)
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
Thread 1671 (Thread-1040):
State: WAITING
Blocked count: 0
Waited count: 1
Waiting on java.lang.Object@1ba4552
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
Thread 1646 (process reaper):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.lang.UNIXProcess.waitForProcessExit(Native Method)
java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 1645 (JVM Runner jvm_200908041626_0001_m_2100332869 spawned.):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.UNIXProcess@5e34ce
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
org.apache.hadoop.util.Shell.run(Shell.java:134)
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
Thread 1644 (Thread-1023):
State: WAITING
Blocked count: 0
Waited count: 1
Waiting on java.lang.Object@1bee6f
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
Thread 143 (process reaper):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.lang.UNIXProcess.waitForProcessExit(Native Method)
java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 141 (JVM Runner jvm_200908041626_0001_r_-836397802 spawned.):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.UNIXProcess@11fceed
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
org.apache.hadoop.util.Shell.run(Shell.java:134)
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
Thread 140 (Thread-89):
State: WAITING
Blocked count: 0
Waited count: 1
Waiting on java.lang.Object@1c63791
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
Thread 33 (IPC Client (47) connection to localhost/127.0.0.1:9001 from
root):
State: TIMED_WAITING
Blocked count: 5563
Waited count: 5552
Stack:
java.lang.Object.wait(Native Method)
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
Thread 32 (Directory/File cleanup thread):
State: WAITING
Blocked count: 0
Waited count: 165
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fdffb1
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
Thread 8 (taskCleanup):
State: WAITING
Blocked count: 0
Waited count: 2
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@19c205b
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:352)
java.lang.Thread.run(Thread.java:619)
Thread 31 (TaskLauncher for task):
State: WAITING
Blocked count: 4
Waited count: 5
Waiting on java.util.LinkedList@1d9d565
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1662)
Thread 30 (TaskLauncher for task):
State: WAITING
Blocked count: 783
Waited count: 781
Waiting on java.util.LinkedList@1fbebee
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1662)
Thread 29 (Map-events fetcher for all reduce tasks on
tracker_ip-10-244-145-237.ec2.internal:localhost/127.0.0.1:40135):
State: TIMED_WAITING
Blocked count: 10965
Waited count: 10957
Stack:
java.lang.Object.wait(Native Method)
org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:671)
Thread 27 (IPC Server handler 7 on 40135):
State: WAITING
Blocked count: 5
Waited count: 2565
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 26 (IPC Server handler 6 on 40135):
State: WAITING
Blocked count: 11
Waited count: 2565
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 25 (IPC Server handler 5 on 40135):
State: WAITING
Blocked count: 11
Waited count: 2566
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 24 (IPC Server handler 4 on 40135):
State: WAITING
Blocked count: 7
Waited count: 2565
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 23 (IPC Server handler 3 on 40135):
State: WAITING
Blocked count: 10
Waited count: 2565
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 22 (IPC Server handler 2 on 40135):
State: WAITING
Blocked count: 11
Waited count: 2566
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 21 (IPC Server handler 1 on 40135):
State: WAITING
Blocked count: 15
Waited count: 2567
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 20 (IPC Server handler 0 on 40135):
State: WAITING
Blocked count: 12
Waited count: 2565
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 17 (IPC Server listener on 40135):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
Thread 19 (IPC Server Responder):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
Thread 16 (Timer-0):
State: TIMED_WAITING
Blocked count: 1
Waited count: 298
Stack:
java.lang.Object.wait(Native Method)
java.util.TimerThread.mainLoop(Timer.java:509)
java.util.TimerThread.run(Timer.java:462)
Thread 15 (11502424@qtp0-0 - Acceptor0 SelectChannelConnector@0.0.0.0:50060
):
State: RUNNABLE
Blocked count: 50
Waited count: 1
Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Thread 4 (Signal Dispatcher):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
Thread 3 (Finalizer):
State: WAITING
Blocked count: 424
Waited count: 425
Waiting on java.lang.ref.ReferenceQueue$Lock@7b6d1c
Stack:
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
State: WAITING
Blocked count: 445
Waited count: 446
Waiting on java.lang.ref.Reference$Lock@16ee240
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
State: RUNNABLE
Blocked count: 2973
Waited count: 5789
Stack:
sun.management.ThreadImpl.getThreadInfo0(Native Method)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1383)
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1129)
org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1779)
org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2881)
2009-08-04 18:53:23,471 INFO org.apache.hadoop.mapred.TaskTracker: About to
purge task: attempt_200908041626_0001_m_000135_3
2009-08-04 18:53:23,471 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 3
2009-08-04 18:53:23,471 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200908041626_0001_m_000135_3 done; removing files.
2009-08-04 18:53:23,550 INFO org.apache.hadoop.mapred.IndexCache: Map ID
attempt_200908041626_0001_m_000135_3 not found in cache
Thanks in advance for any response!
I'm afraid I'm somewhat clueless about this.
Mathias
Re: Some tasks fail to report status between the end of the map and
the beginning of the merge
Posted by Mathias De Maré <ma...@gmail.com>.
2009/8/12 Mathias De Maré <ma...@gmail.com>
> Thank you, that's very useful.
> In addition, I changed the way the tasks work, so they store their data in
> HBase now (since it's more suited for handling small files).
> I'm not 100% sure yet if the problems have been resolved (still doing
> extensive testing), but I think I might have gotten rid of them (and I'll
> add the 'skipping records' option in case I do get a failure).
>
Hi,
I can get everything to 'run' successfully now, but there are still some
tasks that crash.
I was thinking perhaps my Writable class is the issue, so I'll just post it
here. Does anyone notice anything that could cause a hang? In particular the
readFields and write methods could perhaps be the reason (but I just don't
see it).
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
/**
* Contains information on a URL, which other URLs link to it and if it has
been crawled previously.
* @author mathias
*/
public class URLInfo implements Writable, WritableComparable {
String url;
Text[] linkedfrom;
int urlStatus;
int seconds;
public URLInfo() {
url = "";
linkedfrom = new Text[0];
urlStatus = Constants.URL_NEW;
seconds = 1;
}
/**
*
* @param url
* @param linkedfrom Must only contains domain names, nothing appended
* @param urlStatus
*/
public URLInfo(String url, Text[] linkedfrom, int urlStatus, int
seconds) {
this.url = url;
this.linkedfrom = linkedfrom;
this.urlStatus = urlStatus;
this.seconds = seconds;
}
public void write(DataOutput out) throws IOException {
new Text(url).write(out);
new ArrayWritable(Text.class, linkedfrom).write(out);
new IntWritable(urlStatus).write(out);
new IntWritable(seconds).write(out);
}
public void readFields(DataInput in) throws IOException {
url = Text.readString(in);
ArrayWritable aw = new ArrayWritable(Text.class);
aw.readFields(in);
Writable[] linkedfromWritable = aw.get();
linkedfrom = new Text[linkedfromWritable.length];
for(int i=0; i<linkedfromWritable.length; i++) {
linkedfrom[i] = (Text) linkedfromWritable[i];
}
IntWritable iw = new IntWritable();
iw.readFields(in);
urlStatus = iw.get();
IntWritable iw2 = new IntWritable();
iw2.readFields(in);
seconds = iw2.get();
}
public int compareTo(Object o) {
return url.compareToIgnoreCase(((URLInfo) o).url);
}
public void setURLStatus(int urlStatus) {
this.urlStatus = urlStatus;
}
public int getURLStatus() {
return urlStatus;
}
public void setLinkedFrom(Text[] linkedfrom) {
this.linkedfrom = linkedfrom;
}
public Text[] getLinkedFrom() {
return linkedfrom;
}
public String getURL() {
return new String(url);
}
public int getSeconds() {
return seconds;
}
public void setSeconds(int seconds) {
this.seconds = seconds;
}
@Override
public String toString() {
return new String(url);
}
@Override
public boolean equals(Object obj) {
if(!(obj instanceof URLInfo)) {
return false;
}
URLInfo urlObject = (URLInfo) obj;
return this.getURL().equals(urlObject.getURL());
}
}
Re: Some tasks fail to report status between the end of the map and
the beginning of the merge
Posted by Mathias De Maré <ma...@gmail.com>.
Thank you, that's very useful.
In addition, I changed the way the tasks work, so they store their data in
HBase now (since it's more suited for handling small files).
I'm not 100% sure yet if the problems have been resolved (still doing
extensive testing), but I think I might have gotten rid of them (and I'll
add the 'skipping records' option in case I do get a failure).
Mathias
On Mon, Aug 10, 2009 at 5:46 PM, Koji Noguchi <kn...@yahoo-inc.com>wrote:
> > but I didn't find a config option
> > that allows ignoring tasks that fail.
> >
> If 0.18,
>
> http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo
> bConf.html#setMaxMapTaskFailuresPercent(int)<http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo%0AbConf.html#setMaxMapTaskFailuresPercent%28int%29>
> (mapred.max.map.failures.percent)
>
>
>
> http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo
> bConf.html#setMaxReduceTaskFailuresPercent(int)<http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo%0AbConf.html#setMaxReduceTaskFailuresPercent%28int%29>
> (mapred.max.reduce.failures.percent)
>
>
> If 0.19 or later, you can also try skipping records.
>
>
> Koji
>
Re: Some tasks fail to report status between the end of the map and
the beginning of the merge
Posted by Koji Noguchi <kn...@yahoo-inc.com>.
> but I didn't find a config option
> that allows ignoring tasks that fail.
>
If 0.18,
http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo
bConf.html#setMaxMapTaskFailuresPercent(int)
(mapred.max.map.failures.percent)
http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo
bConf.html#setMaxReduceTaskFailuresPercent(int)
(mapred.max.reduce.failures.percent)
If 0.19 or later, you can also try skipping records.
Koji
On 8/9/09 2:18 AM, "Mathias De Maré" <ma...@gmail.com> wrote:
> I changed the maximum split size to 30000, and now most tasks actually
> succeed.
> However, I still have the failure problem with some tasks (with a job I was
> running yesterday, I got a failure after 1900 tasks).
> The problem is that these very few failures can bring down the entire job,
> as they sometimes seem to just keep failing.
> I looked through the mapred-default.xml, but I didn't find a config option
> that allows ignoring tasks that fail. Is there a way to do this (it seems
> like the only alternative I have, since I can't make the failures stop)?
>
> Mathias
>
> 2009/8/5 Mathias De Maré <ma...@gmail.com>
>
>>
>> On Wed, Aug 5, 2009 at 9:38 AM, Jothi Padmanabhan
>> <jo...@yahoo-inc.com>wrote:
>>> Hi,
>>>
>>> Could you please try setting this parameter
>>> mapred.merge.recordsBeforeProgress to a lower number?
>>> See https://issues.apache.org/jira/browse/HADOOP-4714
>>>
>>> Cheers
>>> Jothi
>>
>>
>> Hm, that bug looks like it's applicable during the merge, but my case is a
>> block right before the merge (but seemingly right after all of the map tasks
>> finish).
>> I tried putting mapred.merge.recordsBeforeProgress to 100, and it didn't
>> make a difference.
>>
>> On Wed, Aug 5, 2009 at 10:32 AM, Amogh Vasekar <am...@yahoo-inc.com>wrote:
>>
>>> 10 mins reminds me of parameter mapred.task.timeout . This is
>>> configurable. Or alternatively you might just do a sysout to let tracker
>>> know of its existence ( not an ideal solution though )
>>>
>>> Thanks,
>>> Amogh
>>
>>
>> Well, the map tasks take around 30 minutes to run. Letting the task idle
>> for a large number of minutes after that is a lot of useless time, imho. I
>> tried with 20 minutes now, but I still get timeouts.
>>
>> I don't know if it's useful, but here are the settings of the map tasks at
>> the moment:
>>
>> <configuration>
>> <property>
>> <name>mapred.job.tracker</name>
>> <value>localhost:9001</value>
>> </property>
>> <property>
>> <name>io.sort.mb</name>
>> <value>3</value>
>> <description>The total amount of buffer memory to use while sorting
>> files, in megabytes. By default, gives each merge stream 1MB, which
>> should minimize seeks.</description>
>> </property>
>> <property>
>> <name>mapred.tasktracker.map.tasks.maximum</name>
>> <value>4</value>
>> <description>The maximum number of map tasks that will be run
>> simultaneously by a task tracker.
>> </description>
>> </property>
>>
>> <property>
>> <name>mapred.tasktracker.reduce.tasks.maximum</name>
>> <value>4</value>
>> <description>The maximum number of reduce tasks that will be run
>> simultaneously by a task tracker.
>> </description>
>> </property>
>>
>> <property>
>> <name>mapred.max.split.size</name>
>> <value>1000000</value>
>> </property>
>>
>> <property>
>> <name>mapred.child.java.opts</name>
>> <value>-Xmx400m</value>
>> </property>
>>
>> <property>
>> <name>mapred.merge.recordsBeforeProgress</name>
>> <value>100</value>
>> </property>
>>
>> <property>
>> <name>mapred.task.timeout</name>
>> <value>1200000</value>
>> </property>
>>
>> </configuration>
>>
>> Ideally, I would want to get rid of the delay that causes the timeouts, yet
>> also increase the split size somewhat (though I think a larger split size
>> would increase the delay even more?).
>> The map tasks take around 8000-11000 records as input, and can produce up
>> to 1 000 000 records as output (in case this is relevant).
>>
>> Mathias
>>
>>
Re: Some tasks fail to report status between the end of the map and
the beginning of the merge
Posted by Mathias De Maré <ma...@gmail.com>.
I changed the maximum split size to 30000, and now most tasks actually
succeed.
However, I still have the failure problem with some tasks (with a job I was
running yesterday, I got a failure after 1900 tasks).
The problem is that these very few failures can bring down the entire job,
as they sometimes seem to just keep failing.
I looked through the mapred-default.xml, but I didn't find a config option
that allows ignoring tasks that fail. Is there a way to do this (it seems
like the only alternative I have, since I can't make the failures stop)?
Mathias
2009/8/5 Mathias De Maré <ma...@gmail.com>
>
> On Wed, Aug 5, 2009 at 9:38 AM, Jothi Padmanabhan <jo...@yahoo-inc.com>wrote:
>> Hi,
>>
>> Could you please try setting this parameter
>> mapred.merge.recordsBeforeProgress to a lower number?
>> See https://issues.apache.org/jira/browse/HADOOP-4714
>>
>> Cheers
>> Jothi
>
>
> Hm, that bug looks like it's applicable during the merge, but my case is a
> block right before the merge (but seemingly right after all of the map tasks
> finish).
> I tried putting mapred.merge.recordsBeforeProgress to 100, and it didn't
> make a difference.
>
> On Wed, Aug 5, 2009 at 10:32 AM, Amogh Vasekar <am...@yahoo-inc.com>wrote:
>
>> 10 mins reminds me of parameter mapred.task.timeout . This is
>> configurable. Or alternatively you might just do a sysout to let tracker
>> know of its existence ( not an ideal solution though )
>>
>> Thanks,
>> Amogh
>
>
> Well, the map tasks take around 30 minutes to run. Letting the task idle
> for a large number of minutes after that is a lot of useless time, imho. I
> tried with 20 minutes now, but I still get timeouts.
>
> I don't know if it's useful, but here are the settings of the map tasks at
> the moment:
>
> <configuration>
> <property>
> <name>mapred.job.tracker</name>
> <value>localhost:9001</value>
> </property>
> <property>
> <name>io.sort.mb</name>
> <value>3</value>
> <description>The total amount of buffer memory to use while sorting
> files, in megabytes. By default, gives each merge stream 1MB, which
> should minimize seeks.</description>
> </property>
> <property>
> <name>mapred.tasktracker.map.tasks.maximum</name>
> <value>4</value>
> <description>The maximum number of map tasks that will be run
> simultaneously by a task tracker.
> </description>
> </property>
>
> <property>
> <name>mapred.tasktracker.reduce.tasks.maximum</name>
> <value>4</value>
> <description>The maximum number of reduce tasks that will be run
> simultaneously by a task tracker.
> </description>
> </property>
>
> <property>
> <name>mapred.max.split.size</name>
> <value>1000000</value>
> </property>
>
> <property>
> <name>mapred.child.java.opts</name>
> <value>-Xmx400m</value>
> </property>
>
> <property>
> <name>mapred.merge.recordsBeforeProgress</name>
> <value>100</value>
> </property>
>
> <property>
> <name>mapred.task.timeout</name>
> <value>1200000</value>
> </property>
>
> </configuration>
>
> Ideally, I would want to get rid of the delay that causes the timeouts, yet
> also increase the split size somewhat (though I think a larger split size
> would increase the delay even more?).
> The map tasks take around 8000-11000 records as input, and can produce up
> to 1 000 000 records as output (in case this is relevant).
>
> Mathias
>
>
Re: Some tasks fail to report status between the end of the map and
the beginning of the merge
Posted by Mathias De Maré <ma...@gmail.com>.
> On Wed, Aug 5, 2009 at 9:38 AM, Jothi Padmanabhan <jo...@yahoo-inc.com>wrote:
> Hi,
>
> Could you please try setting this parameter
> mapred.merge.recordsBeforeProgress to a lower number?
> See https://issues.apache.org/jira/browse/HADOOP-4714
>
> Cheers
> Jothi
Hm, that bug looks like it's applicable during the merge, but my case is a
block right before the merge (but seemingly right after all of the map tasks
finish).
I tried putting mapred.merge.recordsBeforeProgress to 100, and it didn't
make a difference.
On Wed, Aug 5, 2009 at 10:32 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> 10 mins reminds me of parameter mapred.task.timeout . This is configurable.
> Or alternatively you might just do a sysout to let tracker know of its
> existence ( not an ideal solution though )
>
> Thanks,
> Amogh
Well, the map tasks take around 30 minutes to run. Letting the task idle for
a large number of minutes after that is a lot of useless time, imho. I tried
with 20 minutes now, but I still get timeouts.
I don't know if it's useful, but here are the settings of the map tasks at
the moment:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>io.sort.mb</name>
<value>3</value>
<description>The total amount of buffer memory to use while sorting
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
<description>The maximum number of map tasks that will be run
simultaneously by a task tracker.
</description>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
<description>The maximum number of reduce tasks that will be run
simultaneously by a task tracker.
</description>
</property>
<property>
<name>mapred.max.split.size</name>
<value>1000000</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx400m</value>
</property>
<property>
<name>mapred.merge.recordsBeforeProgress</name>
<value>100</value>
</property>
<property>
<name>mapred.task.timeout</name>
<value>1200000</value>
</property>
</configuration>
Ideally, I would want to get rid of the delay that causes the timeouts, yet
also increase the split size somewhat (though I think a larger split size
would increase the delay even more?).
The map tasks take around 8000-11000 records as input, and can produce up to
1 000 000 records as output (in case this is relevant).
Mathias
RE: Some tasks fail to report status between the end of the map and
the beginning of the merge
Posted by Amogh Vasekar <am...@yahoo-inc.com>.
10 mins reminds me of parameter mapred.task.timeout . This is configurable. Or alternatively you might just do a sysout to let tracker know of its existence ( not an ideal solution though )
Thanks,
Amogh
-----Original Message-----
From: Mathias De Maré [mailto:mathias.demare@gmail.com]
Sent: Wednesday, August 05, 2009 12:33 PM
To: common-user@hadoop.apache.org
Subject: Some tasks fail to report status between the end of the map and the beginning of the merge
Hi,
I'm having some problems (Hadoop 0.20.0) where map tasks fail to report
status for 10 minutes and get killed eventually. All of the tasks output
around the same amount of data, some only take a few seconds before starting
the 'merge' on the segments, but some seem to fail by just stopping to work
for about 10 minutes.
Several of these failed tasks eventually do succeed, on their 2nd or 3rd
task attempt. It's nice to see them succeed eventually, but each of those
tasks crawls a few thousand websites, and it seems like a terrible waste to
let them retry a few times, in the meantime downloading all of those
websites again, after 10 minutes of doing nothing.
Even more annoyingly, eventually, one of the tasks fails completely, which
then kills the entire job.
I could probably increase the amount of task attempts and simply hope the
tasks will succeed eventually, but that doesn't solve the huge slowdowns and
the recrawling required.
Here's an example of a successful task attempt (this is attempt 3 of a
specific task -- note that it takes around 8 seconds between spill 133 and
spill 134):
2009-08-04 18:38:48,059 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 129
2009-08-04 18:39:00,626 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 18:39:00,626 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 2492163; bufend = 628913; bufvoid = 2988446
2009-08-04 18:39:00,626 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 9727; kvend = 7760; length = 9830
2009-08-04 18:39:01,467 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 130
2009-08-04 18:39:08,136 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 18:39:08,136 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 628913; bufend = 1880222; bufvoid = 2988448
2009-08-04 18:39:08,136 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 7760; kvend = 5793; length = 9830
2009-08-04 18:39:08,463 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 131
2009-08-04 18:39:12,456 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 18:39:12,459 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 1880222; bufend = 136018; bufvoid = 2988448
2009-08-04 18:39:12,459 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 5793; kvend = 3826; length = 9830
2009-08-04 18:39:12,697 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 132
2009-08-04 18:39:23,138 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 18:39:23,138 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 136018; bufend = 1347353; bufvoid = 2988448
2009-08-04 18:39:23,138 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 3826; kvend = 1859; length = 9830
2009-08-04 18:39:25,747 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 133
2009-08-04 18:47:49,823 INFO org.apache.hadoop.mapred.MapTask:
Starting flush of map output
2009-08-04 18:47:50,132 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 134
2009-08-04 18:47:50,525 INFO org.apache.hadoop.mapred.Merger: Merging
135 sorted segments
2009-08-04 18:47:50,528 INFO org.apache.hadoop.mapred.Merger: Merging
9 intermediate segments out of a total of 135
2009-08-04 18:47:52,224 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 127
2009-08-04 18:47:53,837 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 118
2009-08-04 18:47:55,417 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 109
2009-08-04 18:47:56,990 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 100
2009-08-04 18:47:58,492 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 91
2009-08-04 18:48:00,191 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 82
2009-08-04 18:48:02,315 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 73
2009-08-04 18:48:04,184 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 64
2009-08-04 18:48:06,162 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 55
2009-08-04 18:48:08,149 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 46
2009-08-04 18:48:09,888 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 37
2009-08-04 18:48:11,744 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 28
2009-08-04 18:48:13,544 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 19
2009-08-04 18:48:21,177 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 10 segments left of total size: 153426836
bytes
2009-08-04 18:48:46,226 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_200908041626_0001_m_000133_2 is done. And is in the
process of commiting
2009-08-04 18:48:46,373 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_200908041626_0001_m_000133_2' done.
And here the same task, but an earlier attempt (it only goes up to spill
132, but those do vary slightly in size, right?):
2009-08-04 17:15:49,112 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 321905; bufend = 1464691; bufvoid = 2988448
2009-08-04 17:15:49,112 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 7764; kvend = 5797; length = 9830
2009-08-04 17:15:49,910 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 126
2009-08-04 17:16:00,785 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:16:00,785 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 1464691; bufend = 2580856; bufvoid = 2988448
2009-08-04 17:16:00,785 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 5797; kvend = 3830; length = 9830
2009-08-04 17:16:05,728 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 127
2009-08-04 17:16:20,883 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:16:20,883 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 2580856; bufend = 1000978; bufvoid = 2988448
2009-08-04 17:16:20,883 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 3830; kvend = 1863; length = 9830
2009-08-04 17:16:21,709 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 128
2009-08-04 17:16:24,461 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:16:24,461 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 1000978; bufend = 2246166; bufvoid = 2988448
2009-08-04 17:16:24,461 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 1863; kvend = 9727; length = 9830
2009-08-04 17:16:24,934 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 129
2009-08-04 17:16:40,853 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:16:40,853 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 2246166; bufend = 374750; bufvoid = 2988435
2009-08-04 17:16:40,853 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 9727; kvend = 7760; length = 9830
2009-08-04 17:16:43,740 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 130
2009-08-04 17:16:49,141 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:16:49,141 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 374750; bufend = 1653302; bufvoid = 2988448
2009-08-04 17:16:49,141 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 7760; kvend = 5793; length = 9830
2009-08-04 17:16:49,868 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 131
2009-08-04 17:17:04,902 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2009-08-04 17:17:04,902 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 1653302; bufend = 2882491; bufvoid = 2988448
2009-08-04 17:17:04,902 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 5793; kvend = 3826; length = 9830
2009-08-04 17:17:05,355 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 132
There's no more output after that last spill, since the task eventually got
killed after 10 minutes of nothing.
And in case it can help, here's a thread dump for one of those failed tasks:
2009-08-04 18:53:23,460 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200908041626_0001_m_000135_3: Task
attempt_200908041626_0001_m_000135_3 failed to report status for 602
seconds. Killing!
2009-08-04 18:53:23,470 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task
31 active threads
Thread 1673 (process reaper):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.lang.UNIXProcess.waitForProcessExit(Native Method)
java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 1672 (JVM Runner jvm_200908041626_0001_m_1938204689 spawned.):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.UNIXProcess@155d434
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
org.apache.hadoop.util.Shell.run(Shell.java:134)
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
Thread 1671 (Thread-1040):
State: WAITING
Blocked count: 0
Waited count: 1
Waiting on java.lang.Object@1ba4552
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
Thread 1646 (process reaper):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.lang.UNIXProcess.waitForProcessExit(Native Method)
java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 1645 (JVM Runner jvm_200908041626_0001_m_2100332869 spawned.):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.UNIXProcess@5e34ce
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
org.apache.hadoop.util.Shell.run(Shell.java:134)
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
Thread 1644 (Thread-1023):
State: WAITING
Blocked count: 0
Waited count: 1
Waiting on java.lang.Object@1bee6f
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
Thread 143 (process reaper):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.lang.UNIXProcess.waitForProcessExit(Native Method)
java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 141 (JVM Runner jvm_200908041626_0001_r_-836397802 spawned.):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.UNIXProcess@11fceed
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
org.apache.hadoop.util.Shell.run(Shell.java:134)
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
Thread 140 (Thread-89):
State: WAITING
Blocked count: 0
Waited count: 1
Waiting on java.lang.Object@1c63791
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
Thread 33 (IPC Client (47) connection to localhost/127.0.0.1:9001 from
root):
State: TIMED_WAITING
Blocked count: 5563
Waited count: 5552
Stack:
java.lang.Object.wait(Native Method)
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
Thread 32 (Directory/File cleanup thread):
State: WAITING
Blocked count: 0
Waited count: 165
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fdffb1
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
Thread 8 (taskCleanup):
State: WAITING
Blocked count: 0
Waited count: 2
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@19c205b
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:352)
java.lang.Thread.run(Thread.java:619)
Thread 31 (TaskLauncher for task):
State: WAITING
Blocked count: 4
Waited count: 5
Waiting on java.util.LinkedList@1d9d565
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1662)
Thread 30 (TaskLauncher for task):
State: WAITING
Blocked count: 783
Waited count: 781
Waiting on java.util.LinkedList@1fbebee
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1662)
Thread 29 (Map-events fetcher for all reduce tasks on
tracker_ip-10-244-145-237.ec2.internal:localhost/127.0.0.1:40135):
State: TIMED_WAITING
Blocked count: 10965
Waited count: 10957
Stack:
java.lang.Object.wait(Native Method)
org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:671)
Thread 27 (IPC Server handler 7 on 40135):
State: WAITING
Blocked count: 5
Waited count: 2565
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 26 (IPC Server handler 6 on 40135):
State: WAITING
Blocked count: 11
Waited count: 2565
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 25 (IPC Server handler 5 on 40135):
State: WAITING
Blocked count: 11
Waited count: 2566
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 24 (IPC Server handler 4 on 40135):
State: WAITING
Blocked count: 7
Waited count: 2565
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 23 (IPC Server handler 3 on 40135):
State: WAITING
Blocked count: 10
Waited count: 2565
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 22 (IPC Server handler 2 on 40135):
State: WAITING
Blocked count: 11
Waited count: 2566
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 21 (IPC Server handler 1 on 40135):
State: WAITING
Blocked count: 15
Waited count: 2567
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 20 (IPC Server handler 0 on 40135):
State: WAITING
Blocked count: 12
Waited count: 2565
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@17da690
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 17 (IPC Server listener on 40135):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
Thread 19 (IPC Server Responder):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
Thread 16 (Timer-0):
State: TIMED_WAITING
Blocked count: 1
Waited count: 298
Stack:
java.lang.Object.wait(Native Method)
java.util.TimerThread.mainLoop(Timer.java:509)
java.util.TimerThread.run(Timer.java:462)
Thread 15 (11502424@qtp0-0 - Acceptor0 SelectChannelConnector@0.0.0.0:50060
):
State: RUNNABLE
Blocked count: 50
Waited count: 1
Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Thread 4 (Signal Dispatcher):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
Thread 3 (Finalizer):
State: WAITING
Blocked count: 424
Waited count: 425
Waiting on java.lang.ref.ReferenceQueue$Lock@7b6d1c
Stack:
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
State: WAITING
Blocked count: 445
Waited count: 446
Waiting on java.lang.ref.Reference$Lock@16ee240
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
State: RUNNABLE
Blocked count: 2973
Waited count: 5789
Stack:
sun.management.ThreadImpl.getThreadInfo0(Native Method)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1383)
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1129)
org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1779)
org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2881)
2009-08-04 18:53:23,471 INFO org.apache.hadoop.mapred.TaskTracker: About to
purge task: attempt_200908041626_0001_m_000135_3
2009-08-04 18:53:23,471 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 3
2009-08-04 18:53:23,471 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200908041626_0001_m_000135_3 done; removing files.
2009-08-04 18:53:23,550 INFO org.apache.hadoop.mapred.IndexCache: Map ID
attempt_200908041626_0001_m_000135_3 not found in cache
Thanks in advance for any response!
I'm afraid I'm somewhat clueless about this.
Mathias