You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Pratyush Banerjee <pr...@aol.com> on 2008/07/17 09:42:04 UTC
Map reduce jobs started failing suddenly.....
Hi All,
We have been using hadoop-0.17.1 in a 50 machine cluster. The
configuration of the cluster is as follows:-
m/c 1: hadoop-d01 : NameNode/Secondary NameNode/ Jobtracker
rest: hadoop-d02 to hadoop-d50 : Datanodes and tasktrackers
We have been running map reduce jobs in the above mentioned clusters
successfully for sometime. But suddenly all map reduce jobs have started
failing. After failure of our own custom Map red jobs, we even tried
executing the wordcount example which comes along with hadoop examples.
Even that failed.
The logs don't seem to reveal much in the context. Here are excerpts
from the logs which appeared relevant to me.
NAMENODE: JOBTRACKER LOG
2008-07-17 00:30:00,856 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing rack-local task tip_200807090133_0048_m_0000442008-07-17
00:30:00,856 INFO org.apache.hadoop.mapred.JobTracker: Adding task
'task_200807090133_0048_m_000044_0' to tip
tip_200807090133_0048_m_000044, for tracker
'tracker_hadoop-d34.search.aol.com:localhost.localdomain/127.0.0.1:2375'2008-07-17
00:30:00,877 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
rack-local task tip_200807090133_0048_m_0000452008-07-17 00:30:00,877
INFO org.apache.hadoop.mapred.JobTracker: Adding task
'task_200807090133_0048_m_000045_0' to tip
tip_200807090133_0048_m_000045, for tracker
'tracker_hadoop-d33.search.aol.com:localhost.localdomain/127.0.0.1:3484'2008-07-17
00:30:00,890 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
rack-local task tip_200807090133_0048_m_0000462008-07-17 00:30:00,890
INFO org.apache.hadoop.mapred.JobTracker: Adding task
'task_200807090133_0048_m_000046_0' to tip
tip_200807090133_0048_m_000046, for tracker
'tracker_hadoop-d10.search.aol.com:localhost.localdomain/127.0.0.1:7236'2008-07-17
00:30:00,903 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
rack-local task tip_200807090133_0048_m_0000472008-07-17 00:30:00,904
INFO org.apache.hadoop.mapred.JobTracker: Adding task
'task_200807090133_0048_m_000047_0' to tip
tip_200807090133_0048_m_000047, for tracker
'tracker_hadoop-d11.search.aol.com:localhost.localdomain/127.0.0.1:64400'2008-07-17
00:30:00,909 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
data-local task tip_200807090133_0048_m_0000482008-07-17 00:30:00,909
INFO org.apache.hadoop.mapred.JobTracker: Adding task
'task_200807090133_0048_m_000048_0' to tip
tip_200807090133_0048_m_000048, for tracker
'tracker_hadoop-d12.search.aol.com:localhost.localdomain/127.0.0.1:6953'2008-07-17
00:30:00,923 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
rack-local task tip_200807090133_0048_m_0000492008-07-17 00:30:00,923
INFO org.apache.hadoop.mapred.JobTracker: Adding task
'task_200807090133_0048_m_000049_0' to tip
tip_200807090133_0048_m_000049, for tracker
'tracker_hadoop-d20.search.aol.com:localhost.localdomain/127.0.0.1:31257'2008-07-17
00:30:00,934 INFO org.apache.hadoop.mapred.TaskInProgress: Error from
task_200807090133_0048_m_000000_0: java.io.IOException: Task process
exitwith nonzero status of 1. at
org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)2008-07-17
00:30:00,935 INFO org.apache.hadoop.mapred.JobTracker: Adding task
'task_200807090133_0048_r_000000_0' to tip
tip_200807090133_0048_r_000000, for tracker
'tracker_hadoop-d17.search.aol.com:localhost.localdomain/127.0.0.1:6179'2008-07-17
00:30:01,232 INFO org.apache.hadoop.mapred.TaskInProgress: Error from
task_200807090133_0048_m_000003_0: java.io.IOException: Task process
exitwith nonzero status of 1. at
org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)
2008-07-17 00:30:01,475 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from task_200807090133_0048_r_000000_0: java.io.IOException: Task
process exitwith nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)
2008-07-17 00:30:01,476 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200807090133_0048_m_000000_0' from
'tracker_hadoop-d17.search.aol.com:localhost.localdomain/127.0.0.1:6179'
2008-07-17 00:30:01,516 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing rack-local task tip_200807090133_0048_m_000000
2008-07-17 00:30:01,516 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'task_200807090133_0048_m_000000_1' to tip
tip_200807090133_0048_m_000000, for tracker
'tracker_hadoop-d42.search.aol.com:localhost.localdomain/127.0.0.1:1914'
2008-07-17 00:30:01,619 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from task_200807090133_0048_m_000008_0: java.io.IOException: Task
process exitwith nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)
2008-07-17 00:30:01,673 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from task_200807090133_0048_m_000001_0: java.io.IOException: Task
process exitwith nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)
2008-07-17 00:30:01,673 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from task_200807090133_0048_m_000002_0: java.io.IOException: Task
process exitwith nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)
When we examined the failed jobs from the web-UI every failed job listed
about 13 blacklisted nodes. The numbers of the NON-blacklisted nodes are
compiled from different failed jobs are
hadoop-d01, hadoop-d10
hadoop-d12, hadoop-d15, hadoop-d16, hadoop-d18, hadoop-d19
hadoop-d22, hadoop-d23, hadoop-d24, hadoop-d27, hadoop-d28
hadoop-d33
hadoop-d41, hadoop-d45
hadoop-d50
For all the other nodes which have been blacklisted by one or the other
failing jobs, no entries are present in the logs.
For the remaining live machines, following are the different excerpts
from logs.
Hadoop-d10-Tasktracker log
2008-07-17 01:22:12,044 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(task_200807090133_0053_m_000085_0,8) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200807090133_0053/task_200807090133_0053_m_000085_0/output/file.out.index in any of the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2300)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
Incidentally, we are writing log files into the hadoop-dfs continuously
from some of our web servers. We had just added a couple of new
web-servers recently (2 days back to be precise), which cuses much more
data to be dumped into the hdfs. Strangely, the map reduce jobs started
failing right after that (might be a coincidence, might be not).
Can anybody suggest why this is happening, or any remedy thereof.
thanks and regards
Pratyush Banerjee