You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Arun C Murthy <ac...@yahoo-inc.com> on 2011/02/01 06:36:01 UTC

Re: What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?

Please don't cross-post, CDH questions should go to their user lists.

On Jan 31, 2011, at 6:15 AM, Kiss Tibor wrote:

> Hi!
>
> I was running a Hadoop cluster on Amazon EC2 instances, then after 2  
> days of work, one of the worker nodes just simply died (I cannot  
> connect to the instance either). That node also appears on the  
> dfshealth as dead node.
> Until now everything is normal.
>
> Unfortunately the job it was running didn't survived. The cluster it  
> had 8 worker nodes, each with 4 mappers and 2 reducers. The job in  
> cause it had ~1200 map tasks and 10 reduce tasks.
> One of the node died and I see around 31 failed attempts in the  
> jobtracker log.  The log is very similar with the one  somebody  
> placed it here: http://pastie.org/pastes/1270614
>
> Some of the attempts (but not all!) has been retried and I saw at  
> least two of them which finally is getting in a successful state.
> The following two lines appears several times in my jobtracker log:
> 2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress:  
> Running list for reducers missing!! Job details are missing.
> 2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress:  
> Failed cache for reducers missing!! Job details are missing.
>
> These pair of log lines could be the signals that the job couldn't  
> be finished by re-scheduling the failed attempts.
> Nothing special I have seen in namenode logs.
>
> Of course I rerun the failed job which finished successfully. But my  
> problem is that I would like to understand the failover conditions.  
> What could be lost, which part of the hadoop is not fault tolerant  
> in this sense that it happens to see those warnings mentioned  
> earlier. Is there a chance to control such kind of situations?
>
> I am using CDH3b3 version, so it is a developing version of Hadoop.
> Somebody knows about a special bug or fix which in the near future  
> can solve the problem?
>
> Regards
> Tibor Kiss
>
>
>


Re: What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
Hi Tibor,

I¹m afraid I don¹t have time to go over the logs you¹ve attached but

>  just the node failure interfered with another problem, which may be marke by
"User-directories for the user ec2-user are already initialized on this TT. Not
doing anything."
>
This log was marked as INFO.  Usually that means this is expected and
nothing to worry about.
Log is saying this user has ran some tasks before on this TaskTracker node
and it already has the temp-user/job dir setup so no additional work is
needed (for initializeUserDirs method).

BTW, setting a long timeout for tasks is always a bad idea.
It would be good if you can modify your application so that it would report
progress and set back the timeout to close to 10 mins.

Koji



On 2/1/11 2:07 PM, "Kiss Tibor" <ki...@gmail.com> wrote:

> Thank you Koji,
>
> I don't think that this case has something with CDH only. (I usually run both
> versions of clusters (Apache and CDH) using Whirr.)
>
> I think I extracted the most interesting part of the events.
> I identified the first attempt which failed 4 times. (Now I noticed that this
> 4 times failing is the precondition to fail the job too. I am correct?)
> Then saved the most interesting logs from the tasktracker node where it failed
> first time and the logs from namenode+jobtracker instance.
>
> In short here is what has happened with that attemt.
> 2011-01-28 13:45:03,741   was the job initialization time
> ....
> Those four attempts were scheduled to the following trackers:
> 2011-01-28 19:46:15,153
> tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196
> <http://127.0.0.1:54196>
> 2011-01-29 00:46:51,376
> tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193
> <http://127.0.0.1:37193>
> 2011-01-29 05:47:55,312
> tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557
> <http://127.0.0.1:52557>
> 2011-01-29 10:49:35,833
> tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500
> <http://127.0.0.1:45500>
> The tracker_ip-10-100-205-180.ec2.internal instance (where last two attempts
> were scheduled) died hardly that I couldn't connect to with ssh anymore.
>
> The high interval between retries is due to a unusually very large timeout
> set. (We are running some statistical algorithms on a relatively large split
> size of biological data and the running time for a task is heavily dependent
> on the content, that's why we choosed to do some measurements to guess the
> optimal splitsize and running time distribution. We noticed that 30minutes
> wasn't enough for some tasks, therefore we increased to an amount in which all
> the tasks will fit in and deciding later how to reduce the split size that
> running time to be efficient.)
>
> After the failed node it was marked dead, I rerun the job with the same data
> and the job has been finished in aprox 2 hours.
> (The algoritm is repeatable, does not have randoms to determine different
> running times when rerunning.)
> Looks like that before the node died, we had a severe slowdown. The real cause
> of the slowdown it may be the EC2 instance which finally died, unfortunately
> we didn't monitored in detail to catch such a slowdown. Is it possible that
> such a slowdown on a node to has such a huge impact over the entire job
> running on 8 workers?
>
> Looking into the tasktracker log of the first attempt of failed task,
> searching for the cause of the failure, I see two different things.
> One of them is at the beginning
> ----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201101261418_0021_m_000193_0 task's
> state:UNASSIGNED
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
> launch : attempt_201101261418_0021_m_000193_0 which needs 1 slots
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: In
> TaskLauncher, current free slots : 1 and trying to launch
> attempt_201101261418_0021_m_000193_0 which needs 1 slots
> 2011-01-28 19:46:15,153 INFO
> org.apache.hadoop.mapreduce.server.tasktracker.Localizer: User-directories for
> the user ec2-user are already initialized on this TT. Not doing anything.
> 2011-01-28 19:46:15,169 INFO org.apache.hadoop.mapred.JvmManager: Killing JVM:
> jvm_201101261418_0021_m_456866840
>   (Note that the same JVM killing cause I see it several times with different
> attempts.)
> Then at the timeout:
> ----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
> 2011-01-29 00:46:31,011 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201101261418_0021_m_000193_0: Task
> attempt_201101261418_0021_m_000193_0 failed to report status for 18002
> seconds. Killing!
> 2011-01-29 00:46:31,039 INFO org.apache.hadoop.mapred.TaskTracker: Process
> Thread Dump: lost task
> 43 active threads..... (here I have a long stacktrace, see below)
>
> Is it very interesting the "User-directories for the user ec2-user are already
> initialized on this TT. Not doing anything." After doing nothing, maybe makes
> some sense a timeout even if it was happening on a 5 hours limit.
>
> Another thing is that it may happen to an attempt to be rescheduled 4'th time
> to a failing node. In that case the job failover mechanism cannot be effective
> anymore. At least I know that not the job failover mechanism is buggy, just
> the node failure interfered with another problem, which may be marke by
> "User-directories for the user ec2-user are already initialized on this TT.
> Not doing anything."
>
>
> Here are the log extracts in detail:
> ------------------jobtracker log----------------------------
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000193
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000193_0' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196
> <http://127.0.0.1:54196> '
>
> ----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201101261418_0021_m_000193_0 task's
> state:UNASSIGNED
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
> launch : attempt_201101261418_0021_m_000193_0 which needs 1 slots
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: In
> TaskLauncher, current free slots : 1 and trying to launch
> attempt_201101261418_0021_m_000193_0 which needs 1 slots
> 2011-01-28 19:46:15,153 INFO
> org.apache.hadoop.mapreduce.server.tasktracker.Localizer: User-directories for
> the user ec2-user are already initialized on this TT. Not doing anything.
> 2011-01-28 19:46:15,169 INFO org.apache.hadoop.mapred.JvmManager: Killing JVM:
> jvm_201101261418_0021_m_456866840
>   (Note that the same JVM killing cause I see it several times with different
> attempts.)
>
>
> ------------------jobtracker log----------------------------
> 2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201101261418_0021_m_000193_0: Task
> attempt_201101261418_0021_m_000193_0 failed to report status for 1
> 8002 seconds. Killing!
> 2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_0'
> 2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201101261418_0021_m_000193_0: Task
> attempt_201101261418_0021_m_000193_0 failed to report status for 1
> 8002 seconds. Killing!
> 2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_0'
> 2011-01-29 00:46:36,307 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_0' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196
> <http://127.0.0.1:54196> '
> 2011-01-29 00:46:51,376 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000193
> 2011-01-29 00:46:51,376 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000193_1' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193
> <http://127.0.0.1:37193> '
>
> ----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
> 2011-01-29 00:46:31,011 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201101261418_0021_m_000193_0: Task
> attempt_201101261418_0021_m_000193_0 failed to report status for 18002
> seconds. Killing!
> 2011-01-29 00:46:31,039 INFO org.apache.hadoop.mapred.TaskTracker: Process
> Thread Dump: lost task
> 43 active threads
> Thread 17976 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>     java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 17975 (JVM Runner jvm_201101261418_0021_m_-646088387 spawned.):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 2
>   Waiting on java.lang.UNIXProcess@19570aa2
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>     org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
>     org.apache.hadoop.util.Shell.run(Shell.java:177)
>     org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
>
> org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskContro
> ller.java:70)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmMa
> nager.java:450)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager
> .java:439)
> Thread 17947 (Thread-9955):
>   State: WAITING
>   Blocked count: 4
>   Waited count: 6
>   Waiting on java.lang.Object@217b1ead
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
>     org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
> Thread 17924 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>     java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 17922 (JVM Runner jvm_201101261418_0021_m_913485741 spawned.):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 2
>   Waiting on java.lang.UNIXProcess@118b3420
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>     org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
>     org.apache.hadoop.util.Shell.run(Shell.java:177)
>     org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
>
> org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskContro
> ller.java:70)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmMa
> nager.java:450)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager
> .java:439)
> Thread 17906 (Thread-9934):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 1
>   Waiting on java.lang.Object@4a0b29df
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
>     org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
> Thread 17889 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>     java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 17888 (JVM Runner jvm_201101261418_0021_m_-2031309830 spawned.):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 2
>   Waiting on java.lang.UNIXProcess@762b9c2d
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>     org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
>     org.apache.hadoop.util.Shell.run(Shell.java:177)
>     org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
>
> org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskContro
> ller.java:70)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmMa
> nager.java:450)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager
> .java:439)
> Thread 17871 (Thread-9914):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 1
>   Waiting on java.lang.Object@291a8af7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
>     org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
> Thread 17283 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>     java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 17282 (JVM Runner jvm_201101261418_0021_m_811927398 spawned.):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 2
>   Waiting on java.lang.UNIXProcess@5ff6c2a3
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>     org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
>     org.apache.hadoop.util.Shell.run(Shell.java:177)
>     org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
>
> org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskContro
> ller.java:70)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmMa
> nager.java:450)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager
> .java:439)
> Thread 17248 (Thread-9561):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 1
>   Waiting on java.lang.Object@41648c8d
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
>     org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
> Thread 16493 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>     java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 16492 (JVM Runner jvm_201101261418_0021_r_1210798394 spawned.):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 2
>   Waiting on java.lang.UNIXProcess@7994641d
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>     org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
>     org.apache.hadoop.util.Shell.run(Shell.java:177)
>     org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
>
> org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskContro
> ller.java:70)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmMa
> nager.java:450)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager
> .java:439)
> Thread 16491 (Thread-9136):
>   State: WAITING
>   Blocked count: 0
>   Waited count: 1
>   Waiting on java.lang.Object@7d38ea46
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
>     org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
> Thread 7890 (sendParams-3):
>   State: TIMED_WAITING
>   Blocked count: 14
>   Waited count: 50144
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQu
> eue.java:424)
>
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.
> java:323)
>     java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:874)
>
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)
>
>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907>
)
>     java.lang.Thread.run(Thread.java:619)
> Thread 346 (IPC Client (47) connection to
> ec2-184-73-96-34.compute-1.amazonaws.com/10.195.194.223:8021
> <http://ec2-184-73-96-34.compute-1.amazonaws.com/10.195.194.223:8021>  from
> mapred):
>   State: TIMED_WAITING
>   Blocked count: 128340
>   Waited count: 127868
>   Stack:
>     java.lang.Object.wait(Native Method)
>     org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:645)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:688)
> Thread 10 (taskCleanup):
>   State: WAITING
>   Blocked count: 24
>   Waited count: 32
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d480ea
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:397)
>     java.lang.Thread.run(Thread.java:619)
> Thread 16 (Thread-7):
>   State: WAITING
>   Blocked count: 2352
>   Waited count: 2782
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1b5b8520
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>
> org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.monitor
> (UserLogManager.java:79)
>
> org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager$1.run(U
> serLogManager.java:46)
> Thread 17 (Thread-8):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 58
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.mapred.UserLogCleaner.run(UserLogCleaner.java:92)
> Thread 37 (TaskLauncher for REDUCE tasks):
>   State: WAITING
>   Blocked count: 23
>   Waited count: 36
>   Waiting on java.util.LinkedList@158105e8
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2119)
> Thread 36 (TaskLauncher for MAP tasks):
>   State: WAITING
>   Blocked count: 502
>   Waited count: 555
>   Waiting on java.util.LinkedList@49aacd5f
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2119)
> Thread 33 (Map-events fetcher for all reduce tasks on
> tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196
> <http://127.0.0.1:54196> ):
>   State: TIMED_WAITING
>   Blocked count: 316394
>   Waited count: 377686
>   Stack:
>     java.lang.Object.wait(Native Method)
>
> org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.ja
> va:859)
> Thread 30 (IPC Server handler 7 on 54196):
>   State: WAITING
>   Blocked count: 1140
>   Waited count: 77439
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 29 (IPC Server handler 6 on 54196):
>   State: WAITING
>   Blocked count: 1134
>   Waited count: 77434
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 28 (IPC Server handler 5 on 54196):
>   State: WAITING
>   Blocked count: 1165
>   Waited count: 77436
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 27 (IPC Server handler 4 on 54196):
>   State: WAITING
>   Blocked count: 1131
>   Waited count: 77435
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 26 (IPC Server handler 3 on 54196):
>   State: WAITING
>   Blocked count: 1015
>   Waited count: 77443
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 25 (IPC Server handler 2 on 54196):
>   State: WAITING
>   Blocked count: 1027
>   Waited count: 77437
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 24 (IPC Server handler 1 on 54196):
>   State: WAITING
>   Blocked count: 1083
>   Waited count: 77444
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 23 (IPC Server handler 0 on 54196):
>   State: WAITING
>   Blocked count: 1018
>   Waited count: 77435
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 20 (IPC Server listener on 54196):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 1
>   Stack:
>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
>     org.apache.hadoop.ipc.Server$Listener.run(Server.java:379)
> Thread 22 (IPC Server Responder):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>     org.apache.hadoop.ipc.Server$Responder.run(Server.java:539)
> Thread 21 (Timer thread for monitoring rpc):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 41380
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.util.TimerThread.mainLoop(Timer.java:509)
>     java.util.TimerThread.run(Timer.java:462)
> Thread 19 (Timer thread for monitoring jvm):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 41380
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.util.TimerThread.mainLoop(Timer.java:509)
>     java.util.TimerThread.run(Timer.java:462)
> Thread 18 (Directory/File cleanup thread):
>   State: WAITING
>   Blocked count: 1593
>   Waited count: 2038
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@44aea710
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>
> org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:
> 130)
> Thread 15 (Timer-0):
>   State: TIMED_WAITING
>   Blocked count: 1
>   Waited count: 6898
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.util.TimerThread.mainLoop(Timer.java:509)
>     java.util.TimerThread.run(Timer.java:462)
> Thread 14 (1720982966@qtp0-0 - Acceptor0 SelectChannelConnector@0.0.0.0:50060
> <http://SelectChannelConnector@0.0.0.0:50060> ):
>   State: RUNNABLE
>   Blocked count: 537
>   Waited count: 43
>   Stack:
>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>
>
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429>
)
>     org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
>
> org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.jav
> a:124)
>
> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
>
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> Thread 12 (Timer thread for monitoring mapred):
>   State: TIMED_WAITING
>   Blocked count: 1
>   Waited count: 41382
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.util.TimerThread.mainLoop(Timer.java:509)
>     java.util.TimerThread.run(Timer.java:462)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 1235
>   Waited count: 1226
>   Waiting on java.lang.ref.ReferenceQueue$Lock@309fe84e
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 1344
>   Waited count: 1340
>   Waiting on java.lang.ref.Reference$Lock@524c71d2
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 68838
>   Waited count: 205783
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>
> org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:14
> 9)
>
> org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
>
> org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:17
> 71)
>     org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1538)
>     org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2280)
>     org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504)
>
> 2011-01-29 00:46:31,039 INFO org.apache.hadoop.mapred.TaskTracker: About to
> purge task: attempt_201101261418_0021_m_000193_0
>
> ----------------------------------------------
> 2011-01-29 05:47:09,730 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201101261418_0021_m_000193_1: Task
> attempt_201101261418_0021_m_000193_1 failed to report status for 1
> 8001 seconds. Killing!
> 2011-01-29 05:47:09,731 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_1'
> 2011-01-29 05:47:09,731 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_1' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193
> <http://127.0.0.1:37193> '
> 2011-01-29 05:47:12,891 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000544
> 2011-01-29 05:47:12,891 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000544_0' to tip
> task_201101261418_0021_m_000544, for tracker
> 'tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193
> <http://127.0.0.1:37193> '
> 2011-01-29 05:47:12,891 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_1'
> 2011-01-29 05:47:55,312 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_201101261418_0021_m_000517_0' has completed
> task_201101261418_0021_m_000517 successfully.
> 2011-01-29 05:47:55,312 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000193
> 2011-01-29 05:47:55,312 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000193_2' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557
> <http://127.0.0.1:52557> '
>   (Note that the corresponding tasktracker log from
> tracker_ip-10-116-186-223.ec2.internal I don't have)
>
> ----------------------------------------------
> 2011-01-29 10:48:12,869 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201101261418_0021_m_000193_2: Task
> attempt_201101261418_0021_m_000193_2 failed to report status for 1
> 8000 seconds. Killing!
> 2011-01-29 10:48:12,869 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_2'
> 2011-01-29 10:48:12,869 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_2' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557
> <http://127.0.0.1:52557> '
> 2011-01-29 10:48:15,995 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000722
> 2011-01-29 10:48:15,996 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000722_0' to tip
> task_201101261418_0021_m_000722, for tracker
> 'tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557
> <http://127.0.0.1:52557> '
> 2011-01-29 10:48:15,996 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_2'
> 2011-01-29 10:49:35,832 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_201101261418_0021_m_000697_0' has completed
> task_201101261418_0021_m_000697 successfully.
> 2011-01-29 10:49:35,833 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000193
> 2011-01-29 10:49:35,833 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000193_3' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500
> <http://127.0.0.1:45500> '
>
> ----------------------------------------------
> 2011-01-29 15:49:56,006 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201101261418_0021_m_000193_3: Task
> attempt_201101261418_0021_m_000193_3 failed to report status for 1
> 8002 seconds. Killing!
> 2011-01-29 15:49:56,006 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_3'
> 2011-01-29 15:49:56,007 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_3' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500
> <http://127.0.0.1:45500> '
> 2011-01-29 15:49:59,070 INFO org.apache.hadoop.mapred.TaskInProgress:
> TaskInProgress task_201101261418_0021_m_000193 has failed 4 times.
> 2011-01-29 15:49:59,070 INFO org.apache.hadoop.mapred.JobInProgress: Aborting
> job job_201101261418_0021
> 2011-01-29 15:49:59,070 INFO org.apache.hadoop.mapred.JobInProgress: Killing
> job 'job_201101261418_0021'
> 2011-01-29 15:49:59,072 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (JOB_CLEANUP) 'attempt_201101261418_0021_m_001294_0' to tip
> task_201101261418_0021_m_001294, for tracker
> 'tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500
> <http://127.0.0.1:45500> '
> 2011-01-29 15:49:59,072 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_3'
> 2011-01-29 15:50:04,244 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000876_0'
> 2011-01-29 15:50:04,900 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000781_0'
> 2011-01-29 15:50:05,043 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000877_0'
> 2011-01-29 15:50:05,132 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000865_0'
> 2011-01-29 15:50:05,736 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000875_0'
> 2011-01-29 15:50:05,768 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000864_0'
> 2011-01-29 15:50:06,334 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000866_0'
> .......
>
> Here is the end of log details:
>
> Thank you for your attention.
>
> Tibor
>
> On Tue, Feb 1, 2011 at 5:33 PM, Koji Noguchi <kn...@yahoo-inc.com> wrote:
>> (Bcc CDH alias)
>>
>>
>>> Please don't cross-post, CDH questions should go to their user lists.
>>>
>> Was this CDH specific?
>>
>> Did the job show up as failed on the jobtracker webui?
>> If yes, can you grep a jobtracker log to see something like
>>
>> 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.TaskInProgress:
>> TaskInProgress task_201101040441_3330
>> 49_r_000004 has failed 4 times.
>> 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Aborting
>> job job_201101040441_333049
>> 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Killing
>> job 'job_201101040441_333049'
>>
>> which tells you which task failure caused the job to fail.
>> Then you can look at the userlog of those task attempts to see why they
>> failed.
>>
>> Ideally this info should show up on the webui.
>>
>> On the other hands, if the job just hang for hours, there¹s probably a bug on
>> the framework.
>>
>> Koji
>>
>>
>> On 1/31/11 9:36 PM, "Arun C Murthy" <acm@yahoo-inc.com
>> <ht...@yahoo-inc.com> > wrote:
>>
>>> Please don't cross-post, CDH questions should go to their user lists.
>>>
>>> On Jan 31, 2011, at 6:15 AM, Kiss Tibor wrote:
>>>
>>>> Hi!
>>>>
>>>> I was running a Hadoop cluster on Amazon EC2 instances, then after 2 days
>>>> of work, one of the worker nodes just simply died (I cannot connect to the
>>>> instance either). That node also appears on the dfshealth as dead node.
>>>>  Until now everything is normal.
>>>>
>>>> Unfortunately the job it was running didn't survived. The cluster it had 8
>>>> worker nodes, each with 4 mappers and 2 reducers. The job in cause it had
>>>> ~1200 map tasks and 10 reduce tasks.
>>>>  One of the node died and I see around 31 failed attempts in the jobtracker
>>>> log.  The log is very similar with the one  somebody placed it here:
>>>> http://pastie.org/pastes/1270614
>>>>
>>>> Some of the attempts (but not all!) has been retried and I saw at least two
>>>> of them which finally is getting in a successful state.
>>>> The following two lines appears several times in my jobtracker log:
>>>> 2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress:
>>>> Running list for reducers missing!! Job details are missing.
>>>>  2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress:
>>>> Failed cache for reducers missing!! Job details are missing.
>>>>
>>>> These pair of log lines could be the signals that the job couldn't be
>>>> finished by re-scheduling the failed attempts.
>>>>  Nothing special I have seen in namenode logs.
>>>>
>>>> Of course I rerun the failed job which finished successfully. But my
>>>> problem is that I would like to understand the failover conditions. What
>>>> could be lost, which part of the hadoop is not fault tolerant in this sense
>>>> that it happens to see those warnings mentioned earlier. Is there a chance
>>>> to control such kind of situations?
>>>>
>>>> I am using CDH3b3 version, so it is a developing version of Hadoop.
>>>> Somebody knows about a special bug or fix which in the near future can
>>>> solve the problem?
>>>>
>>>> Regards
>>>> Tibor Kiss
>>>>
>>>>
>>>>
>>>
>>>
>
>


Re: What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?

Posted by Kiss Tibor <ki...@gmail.com>.
Thank you Koji,

I don't think that this case has something with CDH only. (I usually run
both versions of clusters (Apache and CDH) using Whirr.)

I think I extracted the most interesting part of the events.
I identified the first attempt which failed 4 times. (Now I noticed that
this 4 times failing is the precondition to fail the job too. I am correct?)
Then saved the most interesting logs from the tasktracker node where it
failed first time and the logs from namenode+jobtracker instance.

In short here is what has happened with that attemt.
2011-01-28 13:45:03,741   was the job initialization time
....
Those four attempts were scheduled to the following trackers:
2011-01-28 19:46:15,153
tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196
2011-01-29 00:46:51,376   tracker_ip-10-116-186-223.ec2.internal:localhost/
127.0.0.1:37193
2011-01-29 05:47:55,312
tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557
2011-01-29 10:49:35,833   tracker_ip-10-100-205-180.ec2.internal:localhost/
127.0.0.1:45500
The tracker_ip-10-100-205-180.ec2.internal instance (where last two attempts
were scheduled) died hardly that I couldn't connect to with ssh anymore.

The high interval between retries is due to a unusually very large timeout
set. (We are running some statistical algorithms on a relatively large split
size of biological data and the running time for a task is heavily dependent
on the content, that's why we choosed to do some measurements to guess the
optimal splitsize and running time distribution. We noticed that 30minutes
wasn't enough for some tasks, therefore we increased to an amount in which
all the tasks will fit in and deciding later how to reduce the split size
that running time to be efficient.)

After the failed node it was marked dead, I rerun the job with the same data
and the job has been finished in aprox 2 hours.
(The algoritm is repeatable, does not have randoms to determine different
running times when rerunning.)
Looks like that before the node died, we had a severe slowdown. The real
cause of the slowdown it may be the EC2 instance which finally died,
unfortunately we didn't monitored in detail to catch such a slowdown. Is it
possible that such a slowdown on a node to has such a huge impact over the
entire job running on 8 workers?

Looking into the tasktracker log of the first attempt of failed task,
searching for the cause of the failure, I see two different things.
One of them is at the beginning
----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_201101261418_0021_m_000193_0 task's
state:UNASSIGNED
2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
launch : attempt_201101261418_0021_m_000193_0 which needs 1 slots
2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_201101261418_0021_m_000193_0 which needs 1 slots
2011-01-28 19:46:15,153 INFO
org.apache.hadoop.mapreduce.server.tasktracker.Localizer: User-directories
for the user ec2-user are already initialized on this TT. Not doing
anything.
2011-01-28 19:46:15,169 INFO org.apache.hadoop.mapred.JvmManager: Killing
JVM: jvm_201101261418_0021_m_456866840
  (Note that the same JVM killing cause I see it several times with
different attempts.)
Then at the timeout:
----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
2011-01-29 00:46:31,011 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201101261418_0021_m_000193_0: Task
attempt_201101261418_0021_m_000193_0 failed to report status for 18002
seconds. Killing!
2011-01-29 00:46:31,039 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task
43 active threads..... (here I have a long stacktrace, see below)

Is it very interesting the "User-directories for the user ec2-user are
already initialized on this TT. Not doing anything." After doing nothing,
maybe makes some sense a timeout even if it was happening on a 5 hours
limit.

Another thing is that it may happen to an attempt to be rescheduled 4'th
time to a failing node. In that case the job failover mechanism cannot be
effective anymore. At least I know that not the job failover mechanism is
buggy, just the node failure interfered with another problem, which may be
marke by "User-directories for the user ec2-user are already initialized on
this TT. Not doing anything."


Here are the log extracts in detail:
------------------jobtracker log----------------------------
2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing a non-local task task_201101261418_0021_m_000193
2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (MAP) 'attempt_201101261418_0021_m_000193_0' to tip
task_201101261418_0021_m_000193, for tracker
'tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196'

----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_201101261418_0021_m_000193_0 task's
state:UNASSIGNED
2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
launch : attempt_201101261418_0021_m_000193_0 which needs 1 slots
2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_201101261418_0021_m_000193_0 which needs 1 slots
2011-01-28 19:46:15,153 INFO
org.apache.hadoop.mapreduce.server.tasktracker.Localizer: User-directories
for the user ec2-user are already initialized on this TT. Not doing
anything.
2011-01-28 19:46:15,169 INFO org.apache.hadoop.mapred.JvmManager: Killing
JVM: jvm_201101261418_0021_m_456866840
  (Note that the same JVM killing cause I see it several times with
different attempts.)


------------------jobtracker log----------------------------
2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from attempt_201101261418_0021_m_000193_0: Task
attempt_201101261418_0021_m_000193_0 failed to report status for 1
8002 seconds. Killing!
2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000193_0'
2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from attempt_201101261418_0021_m_000193_0: Task
attempt_201101261418_0021_m_000193_0 failed to report status for 1
8002 seconds. Killing!
2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000193_0'
2011-01-29 00:46:36,307 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_0' to tip
task_201101261418_0021_m_000193, for tracker
'tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196'
2011-01-29 00:46:51,376 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing a non-local task task_201101261418_0021_m_000193
2011-01-29 00:46:51,376 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (MAP) 'attempt_201101261418_0021_m_000193_1' to tip
task_201101261418_0021_m_000193, for tracker
'tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193'

----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
2011-01-29 00:46:31,011 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201101261418_0021_m_000193_0: Task
attempt_201101261418_0021_m_000193_0 failed to report status for 18002
seconds. Killing!
2011-01-29 00:46:31,039 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task
43 active threads
Thread 17976 (process reaper):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.lang.UNIXProcess.waitForProcessExit(Native Method)
    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 17975 (JVM Runner jvm_201101261418_0021_m_-646088387 spawned.):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.lang.UNIXProcess@19570aa2
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
    org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
    org.apache.hadoop.util.Shell.run(Shell.java:177)

org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)

org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskController.java:70)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:450)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:439)
Thread 17947 (Thread-9955):
  State: WAITING
  Blocked count: 4
  Waited count: 6
  Waiting on java.lang.Object@217b1ead
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
Thread 17924 (process reaper):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.lang.UNIXProcess.waitForProcessExit(Native Method)
    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 17922 (JVM Runner jvm_201101261418_0021_m_913485741 spawned.):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.lang.UNIXProcess@118b3420
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
    org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
    org.apache.hadoop.util.Shell.run(Shell.java:177)

org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)

org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskController.java:70)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:450)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:439)
Thread 17906 (Thread-9934):
  State: WAITING
  Blocked count: 1
  Waited count: 1
  Waiting on java.lang.Object@4a0b29df
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
Thread 17889 (process reaper):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.lang.UNIXProcess.waitForProcessExit(Native Method)
    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 17888 (JVM Runner jvm_201101261418_0021_m_-2031309830 spawned.):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.lang.UNIXProcess@762b9c2d
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
    org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
    org.apache.hadoop.util.Shell.run(Shell.java:177)

org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)

org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskController.java:70)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:450)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:439)
Thread 17871 (Thread-9914):
  State: WAITING
  Blocked count: 1
  Waited count: 1
  Waiting on java.lang.Object@291a8af7
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
Thread 17283 (process reaper):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.lang.UNIXProcess.waitForProcessExit(Native Method)
    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 17282 (JVM Runner jvm_201101261418_0021_m_811927398 spawned.):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.lang.UNIXProcess@5ff6c2a3
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
    org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
    org.apache.hadoop.util.Shell.run(Shell.java:177)

org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)

org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskController.java:70)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:450)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:439)
Thread 17248 (Thread-9561):
  State: WAITING
  Blocked count: 1
  Waited count: 1
  Waiting on java.lang.Object@41648c8d
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
Thread 16493 (process reaper):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.lang.UNIXProcess.waitForProcessExit(Native Method)
    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 16492 (JVM Runner jvm_201101261418_0021_r_1210798394 spawned.):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.lang.UNIXProcess@7994641d
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
    org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
    org.apache.hadoop.util.Shell.run(Shell.java:177)

org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)

org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskController.java:70)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:450)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:439)
Thread 16491 (Thread-9136):
  State: WAITING
  Blocked count: 0
  Waited count: 1
  Waiting on java.lang.Object@7d38ea46
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
Thread 7890 (sendParams-3):
  State: TIMED_WAITING
  Blocked count: 14
  Waited count: 50144
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)

java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:424)

java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:323)
    java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:874)

java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
    java.lang.Thread.run(Thread.java:619)
Thread 346 (IPC Client (47) connection to
ec2-184-73-96-34.compute-1.amazonaws.com/10.195.194.223:8021 from mapred):
  State: TIMED_WAITING
  Blocked count: 128340
  Waited count: 127868
  Stack:
    java.lang.Object.wait(Native Method)
    org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:645)
    org.apache.hadoop.ipc.Client$Connection.run(Client.java:688)
Thread 10 (taskCleanup):
  State: WAITING
  Blocked count: 24
  Waited count: 32
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d480ea
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:397)
    java.lang.Thread.run(Thread.java:619)
Thread 16 (Thread-7):
  State: WAITING
  Blocked count: 2352
  Waited count: 2782
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1b5b8520
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)

org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.monitor(UserLogManager.java:79)

org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager$1.run(UserLogManager.java:46)
Thread 17 (Thread-8):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 58
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.mapred.UserLogCleaner.run(UserLogCleaner.java:92)
Thread 37 (TaskLauncher for REDUCE tasks):
  State: WAITING
  Blocked count: 23
  Waited count: 36
  Waiting on java.util.LinkedList@158105e8
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2119)
Thread 36 (TaskLauncher for MAP tasks):
  State: WAITING
  Blocked count: 502
  Waited count: 555
  Waiting on java.util.LinkedList@49aacd5f
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2119)
Thread 33 (Map-events fetcher for all reduce tasks on
tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196):
  State: TIMED_WAITING
  Blocked count: 316394
  Waited count: 377686
  Stack:
    java.lang.Object.wait(Native Method)

org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:859)
Thread 30 (IPC Server handler 7 on 54196):
  State: WAITING
  Blocked count: 1140
  Waited count: 77439
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
Thread 29 (IPC Server handler 6 on 54196):
  State: WAITING
  Blocked count: 1134
  Waited count: 77434
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
Thread 28 (IPC Server handler 5 on 54196):
  State: WAITING
  Blocked count: 1165
  Waited count: 77436
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
Thread 27 (IPC Server handler 4 on 54196):
  State: WAITING
  Blocked count: 1131
  Waited count: 77435
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
Thread 26 (IPC Server handler 3 on 54196):
  State: WAITING
  Blocked count: 1015
  Waited count: 77443
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
Thread 25 (IPC Server handler 2 on 54196):
  State: WAITING
  Blocked count: 1027
  Waited count: 77437
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
Thread 24 (IPC Server handler 1 on 54196):
  State: WAITING
  Blocked count: 1083
  Waited count: 77444
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
Thread 23 (IPC Server handler 0 on 54196):
  State: WAITING
  Blocked count: 1018
  Waited count: 77435
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
Thread 20 (IPC Server listener on 54196):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 1
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
    org.apache.hadoop.ipc.Server$Listener.run(Server.java:379)
Thread 22 (IPC Server Responder):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    org.apache.hadoop.ipc.Server$Responder.run(Server.java:539)
Thread 21 (Timer thread for monitoring rpc):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 41380
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 19 (Timer thread for monitoring jvm):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 41380
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 18 (Directory/File cleanup thread):
  State: WAITING
  Blocked count: 1593
  Waited count: 2038
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@44aea710
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)

org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:130)
Thread 15 (Timer-0):
  State: TIMED_WAITING
  Blocked count: 1
  Waited count: 6898
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 14 (1720982966@qtp0-0 - Acceptor0
SelectChannelConnector@0.0.0.0:50060):
  State: RUNNABLE
  Blocked count: 537
  Waited count: 43
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)

org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
    org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)

org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)

org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)

org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Thread 12 (Timer thread for monitoring mapred):
  State: TIMED_WAITING
  Blocked count: 1
  Waited count: 41382
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 4 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 1235
  Waited count: 1226
  Waiting on java.lang.ref.ReferenceQueue$Lock@309fe84e
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 1344
  Waited count: 1340
  Waiting on java.lang.ref.Reference$Lock@524c71d2
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
  State: RUNNABLE
  Blocked count: 68838
  Waited count: 205783
  Stack:
    sun.management.ThreadImpl.getThreadInfo0(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)

org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)

org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)

org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1771)
    org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1538)
    org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2280)
    org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504)

2011-01-29 00:46:31,039 INFO org.apache.hadoop.mapred.TaskTracker: About to
purge task: attempt_201101261418_0021_m_000193_0

----------------------------------------------
2011-01-29 05:47:09,730 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from attempt_201101261418_0021_m_000193_1: Task
attempt_201101261418_0021_m_000193_1 failed to report status for 1
8001 seconds. Killing!
2011-01-29 05:47:09,731 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000193_1'
2011-01-29 05:47:09,731 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_1' to tip
task_201101261418_0021_m_000193, for tracker
'tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193'
2011-01-29 05:47:12,891 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing a non-local task task_201101261418_0021_m_000544
2011-01-29 05:47:12,891 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (MAP) 'attempt_201101261418_0021_m_000544_0' to tip
task_201101261418_0021_m_000544, for tracker
'tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193'
2011-01-29 05:47:12,891 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000193_1'
2011-01-29 05:47:55,312 INFO org.apache.hadoop.mapred.JobInProgress: Task
'attempt_201101261418_0021_m_000517_0' has completed
task_201101261418_0021_m_000517 successfully.
2011-01-29 05:47:55,312 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing a non-local task task_201101261418_0021_m_000193
2011-01-29 05:47:55,312 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (MAP) 'attempt_201101261418_0021_m_000193_2' to tip
task_201101261418_0021_m_000193, for tracker
'tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557'
  (Note that the corresponding tasktracker log from
tracker_ip-10-116-186-223.ec2.internal I don't have)

----------------------------------------------
2011-01-29 10:48:12,869 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from attempt_201101261418_0021_m_000193_2: Task
attempt_201101261418_0021_m_000193_2 failed to report status for 1
8000 seconds. Killing!
2011-01-29 10:48:12,869 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000193_2'
2011-01-29 10:48:12,869 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_2' to tip
task_201101261418_0021_m_000193, for tracker
'tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557'
2011-01-29 10:48:15,995 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing a non-local task task_201101261418_0021_m_000722
2011-01-29 10:48:15,996 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (MAP) 'attempt_201101261418_0021_m_000722_0' to tip
task_201101261418_0021_m_000722, for tracker
'tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557'
2011-01-29 10:48:15,996 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000193_2'
2011-01-29 10:49:35,832 INFO org.apache.hadoop.mapred.JobInProgress: Task
'attempt_201101261418_0021_m_000697_0' has completed
task_201101261418_0021_m_000697 successfully.
2011-01-29 10:49:35,833 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing a non-local task task_201101261418_0021_m_000193
2011-01-29 10:49:35,833 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (MAP) 'attempt_201101261418_0021_m_000193_3' to tip
task_201101261418_0021_m_000193, for tracker
'tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500'

----------------------------------------------
2011-01-29 15:49:56,006 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from attempt_201101261418_0021_m_000193_3: Task
attempt_201101261418_0021_m_000193_3 failed to report status for 1
8002 seconds. Killing!
2011-01-29 15:49:56,006 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000193_3'
2011-01-29 15:49:56,007 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_3' to tip
task_201101261418_0021_m_000193, for tracker
'tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500'
2011-01-29 15:49:59,070 INFO org.apache.hadoop.mapred.TaskInProgress:
TaskInProgress task_201101261418_0021_m_000193 has failed 4 times.
2011-01-29 15:49:59,070 INFO org.apache.hadoop.mapred.JobInProgress:
Aborting job job_201101261418_0021
2011-01-29 15:49:59,070 INFO org.apache.hadoop.mapred.JobInProgress: Killing
job 'job_201101261418_0021'
2011-01-29 15:49:59,072 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (JOB_CLEANUP) 'attempt_201101261418_0021_m_001294_0' to tip
task_201101261418_0021_m_001294, for tracker
'tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500'
2011-01-29 15:49:59,072 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000193_3'
2011-01-29 15:50:04,244 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000876_0'
2011-01-29 15:50:04,900 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000781_0'
2011-01-29 15:50:05,043 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000877_0'
2011-01-29 15:50:05,132 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000865_0'
2011-01-29 15:50:05,736 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000875_0'
2011-01-29 15:50:05,768 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000864_0'
2011-01-29 15:50:06,334 INFO org.apache.hadoop.mapred.JobTracker: Removing
task 'attempt_201101261418_0021_m_000866_0'
.......

Here is the end of log details:

Thank you for your attention.

Tibor

On Tue, Feb 1, 2011 at 5:33 PM, Koji Noguchi <kn...@yahoo-inc.com> wrote:

>  (Bcc CDH alias)
>
>
> > Please don't cross-post, CDH questions should go to their user lists.
> >
> Was this CDH specific?
>
> Did the job show up as failed on the jobtracker webui?
> If yes, can you grep a jobtracker log to see something like
>
> 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.TaskInProgress:
> TaskInProgress task_201101040441_3330
> 49_r_000004 has failed 4 times.
> 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress:
> Aborting job job_201101040441_333049
> 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress:
> Killing job 'job_201101040441_333049'
>
> which tells you which task failure caused the job to fail.
> Then you can look at the userlog of those task attempts to see why they
> failed.
>
> Ideally this info should show up on the webui.
>
> On the other hands, if the job just hang for hours, there’s probably a bug
> on the framework.
>
> Koji
>
>
> On 1/31/11 9:36 PM, "Arun C Murthy" <ac...@yahoo-inc.com> wrote:
>
> Please don't cross-post, CDH questions should go to their user lists.
>
> On Jan 31, 2011, at 6:15 AM, Kiss Tibor wrote:
>
> Hi!
>
> I was running a Hadoop cluster on Amazon EC2 instances, then after 2 days
> of work, one of the worker nodes just simply died (I cannot connect to the
> instance either). That node also appears on the dfshealth as dead node.
>  Until now everything is normal.
>
> Unfortunately the job it was running didn't survived. The cluster it had 8
> worker nodes, each with 4 mappers and 2 reducers. The job in cause it had
> ~1200 map tasks and 10 reduce tasks.
>  One of the node died and I see around 31 failed attempts in the jobtracker
> log.  The log is very similar with the one  somebody placed it here:
> http://pastie.org/pastes/1270614
>
> Some of the attempts (but not all!) has been retried and I saw at least two
> of them which finally is getting in a successful state.
> The following two lines appears several times in my jobtracker log:
> 2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress:
> Running list for reducers missing!! Job details are missing.
>  2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress:
> Failed cache for reducers missing!! Job details are missing.
>
> These pair of log lines could be the signals that the job couldn't be
> finished by re-scheduling the failed attempts.
>  Nothing special I have seen in namenode logs.
>
> Of course I rerun the failed job which finished successfully. But my
> problem is that I would like to understand the failover conditions. What
> could be lost, which part of the hadoop is not fault tolerant in this sense
> that it happens to see those warnings mentioned earlier. Is there a chance
> to control such kind of situations?
>
> I am using CDH3b3 version, so it is a developing version of Hadoop.
> Somebody knows about a special bug or fix which in the near future can
> solve the problem?
>
> Regards
> Tibor Kiss
>
>
>
>
>
>

Re: What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
(Bcc CDH alias)

> Please don't cross-post, CDH questions should go to their user lists.
>
Was this CDH specific?

Did the job show up as failed on the jobtracker webui?
If yes, can you grep a jobtracker log to see something like

2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.TaskInProgress: TaskInProgress task_201101040441_3330
49_r_000004 has failed 4 times.
2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Aborting job job_201101040441_333049
2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Killing job 'job_201101040441_333049'

which tells you which task failure caused the job to fail.
Then you can look at the userlog of those task attempts to see why they failed.

Ideally this info should show up on the webui.

On the other hands, if the job just hang for hours, there's probably a bug on the framework.

Koji

On 1/31/11 9:36 PM, "Arun C Murthy" <ac...@yahoo-inc.com> wrote:

Please don't cross-post, CDH questions should go to their user lists.

On Jan 31, 2011, at 6:15 AM, Kiss Tibor wrote:

Hi!

I was running a Hadoop cluster on Amazon EC2 instances, then after 2 days of work, one of the worker nodes just simply died (I cannot connect to the instance either). That node also appears on the dfshealth as dead node.
 Until now everything is normal.

Unfortunately the job it was running didn't survived. The cluster it had 8 worker nodes, each with 4 mappers and 2 reducers. The job in cause it had ~1200 map tasks and 10 reduce tasks.
 One of the node died and I see around 31 failed attempts in the jobtracker log.  The log is very similar with the one  somebody placed it here: http://pastie.org/pastes/1270614

Some of the attempts (but not all!) has been retried and I saw at least two of them which finally is getting in a successful state.
The following two lines appears several times in my jobtracker log:
2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress: Running list for reducers missing!! Job details are missing.
 2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress: Failed cache for reducers missing!! Job details are missing.

These pair of log lines could be the signals that the job couldn't be finished by re-scheduling the failed attempts.
 Nothing special I have seen in namenode logs.

Of course I rerun the failed job which finished successfully. But my problem is that I would like to understand the failover conditions. What could be lost, which part of the hadoop is not fault tolerant in this sense that it happens to see those warnings mentioned earlier. Is there a chance to control such kind of situations?

I am using CDH3b3 version, so it is a developing version of Hadoop.
Somebody knows about a special bug or fix which in the near future can solve the problem?

Regards
Tibor Kiss