You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Pratyush Banerjee <pr...@aol.com> on 2008/07/17 09:42:04 UTC
Map reduce jobs started failing suddenly.....

Hi All,

We have been using hadoop-0.17.1 in a 50 machine cluster. The 
configuration of the cluster is as follows:-

m/c 1: hadoop-d01 : NameNode/Secondary NameNode/ Jobtracker
rest: hadoop-d02 to hadoop-d50 : Datanodes and tasktrackers

We have been running map reduce jobs in the above mentioned clusters 
successfully for sometime. But suddenly all map reduce jobs have started 
failing. After failure of our own custom Map red jobs, we even tried 
executing the wordcount example which comes along with hadoop examples.  
Even that failed. 

The logs don't seem to reveal much in the context. Here are excerpts 
from the logs which appeared relevant to me.
NAMENODE: JOBTRACKER LOG
 2008-07-17 00:30:00,856 INFO org.apache.hadoop.mapred.JobInProgress: 
Choosing rack-local task tip_200807090133_0048_m_0000442008-07-17 
00:30:00,856 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'task_200807090133_0048_m_000044_0' to tip 
tip_200807090133_0048_m_000044, for tracker 
'tracker_hadoop-d34.search.aol.com:localhost.localdomain/127.0.0.1:2375'2008-07-17 
00:30:00,877 INFO org.apache.hadoop.mapred.JobInProgress: Choosing 
rack-local task tip_200807090133_0048_m_0000452008-07-17 00:30:00,877 
INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'task_200807090133_0048_m_000045_0' to tip 
tip_200807090133_0048_m_000045, for tracker 
'tracker_hadoop-d33.search.aol.com:localhost.localdomain/127.0.0.1:3484'2008-07-17 
00:30:00,890 INFO org.apache.hadoop.mapred.JobInProgress: Choosing 
rack-local task tip_200807090133_0048_m_0000462008-07-17 00:30:00,890 
INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'task_200807090133_0048_m_000046_0' to tip 
tip_200807090133_0048_m_000046, for tracker 
'tracker_hadoop-d10.search.aol.com:localhost.localdomain/127.0.0.1:7236'2008-07-17 
00:30:00,903 INFO org.apache.hadoop.mapred.JobInProgress: Choosing 
rack-local task tip_200807090133_0048_m_0000472008-07-17 00:30:00,904 
INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'task_200807090133_0048_m_000047_0' to tip 
tip_200807090133_0048_m_000047, for tracker 
'tracker_hadoop-d11.search.aol.com:localhost.localdomain/127.0.0.1:64400'2008-07-17 
00:30:00,909 INFO org.apache.hadoop.mapred.JobInProgress: Choosing 
data-local task tip_200807090133_0048_m_0000482008-07-17 00:30:00,909 
INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'task_200807090133_0048_m_000048_0' to tip 
tip_200807090133_0048_m_000048, for tracker 
'tracker_hadoop-d12.search.aol.com:localhost.localdomain/127.0.0.1:6953'2008-07-17 
00:30:00,923 INFO org.apache.hadoop.mapred.JobInProgress: Choosing 
rack-local task tip_200807090133_0048_m_0000492008-07-17 00:30:00,923 
INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'task_200807090133_0048_m_000049_0' to tip 
tip_200807090133_0048_m_000049, for tracker 
'tracker_hadoop-d20.search.aol.com:localhost.localdomain/127.0.0.1:31257'2008-07-17 
00:30:00,934 INFO org.apache.hadoop.mapred.TaskInProgress: Error from 
task_200807090133_0048_m_000000_0: java.io.IOException: Task process 
exitwith nonzero status of 1.        at 
org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)        
at 
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)2008-07-17 
00:30:00,935 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'task_200807090133_0048_r_000000_0' to tip 
tip_200807090133_0048_r_000000, for tracker 
'tracker_hadoop-d17.search.aol.com:localhost.localdomain/127.0.0.1:6179'2008-07-17 
00:30:01,232 INFO org.apache.hadoop.mapred.TaskInProgress: Error from 
task_200807090133_0048_m_000003_0: java.io.IOException: Task process 
exitwith nonzero status of 1.        at 
org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,475 INFO org.apache.hadoop.mapred.TaskInProgress: 
Error from task_200807090133_0048_r_000000_0: java.io.IOException: Task 
process exitwith nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,476 INFO org.apache.hadoop.mapred.JobTracker: 
Removed completed task 'task_200807090133_0048_m_000000_0' from 
'tracker_hadoop-d17.search.aol.com:localhost.localdomain/127.0.0.1:6179'
2008-07-17 00:30:01,516 INFO org.apache.hadoop.mapred.JobInProgress: 
Choosing rack-local task tip_200807090133_0048_m_000000
2008-07-17 00:30:01,516 INFO org.apache.hadoop.mapred.JobTracker: Adding 
task 'task_200807090133_0048_m_000000_1' to tip 
tip_200807090133_0048_m_000000, for tracker 
'tracker_hadoop-d42.search.aol.com:localhost.localdomain/127.0.0.1:1914'
2008-07-17 00:30:01,619 INFO org.apache.hadoop.mapred.TaskInProgress: 
Error from task_200807090133_0048_m_000008_0: java.io.IOException: Task 
process exitwith nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,673 INFO org.apache.hadoop.mapred.TaskInProgress: 
Error from task_200807090133_0048_m_000001_0: java.io.IOException: Task 
process exitwith nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,673 INFO org.apache.hadoop.mapred.TaskInProgress: 
Error from task_200807090133_0048_m_000002_0: java.io.IOException: Task 
process exitwith nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

When we examined the failed jobs from the web-UI every failed job listed 
about 13 blacklisted nodes. The numbers of the NON-blacklisted nodes are 
compiled from different failed jobs are
hadoop-d01, hadoop-d10
hadoop-d12, hadoop-d15, hadoop-d16, hadoop-d18, hadoop-d19
hadoop-d22, hadoop-d23, hadoop-d24, hadoop-d27, hadoop-d28
hadoop-d33
hadoop-d41, hadoop-d45
hadoop-d50
 
For all the other nodes which have been blacklisted by one or the other 
failing jobs, no entries are present in the logs.

For the remaining live machines, following are the different excerpts 
from logs.
Hadoop-d10-Tasktracker log

2008-07-17 01:22:12,044 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(task_200807090133_0053_m_000085_0,8) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200807090133_0053/task_200807090133_0053_m_000085_0/output/file.out.index in any of the configured local directories
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
    at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2300)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
    at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
    at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
    at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
    at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
    at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
    at org.mortbay.http.HttpServer.service(HttpServer.java:954)
    at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
    at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
    at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
    at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
    at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
    at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

Incidentally, we are writing log files into the hadoop-dfs continuously 
from some of our web servers. We had just added a couple of new 
web-servers recently (2 days back to be precise), which cuses much more 
data to be dumped into the hdfs. Strangely, the map reduce jobs started 
failing right after that (might be a coincidence, might be not).

Can anybody suggest why this is happening, or any remedy thereof.

thanks and regards

Pratyush Banerjee