You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ian Soboroff <ia...@nist.gov> on 2009/02/03 04:09:22 UTC

My tasktrackers keep getting lost...

I hope someone can help me out.  I'm getting started with Hadoop, 
have written the firt part of my project (a custom InputFormat), and am
now using that to test out my cluster setup.

I'm running 0.19.0.  I have five dual-core Linux workstations with most
of a 250GB disk available for playing, and am controlling things from my
Mac Pro.  (This is not the production cluster, that hasn't been
assembled yet.  This is just to get the code working and figure out the
bumps.)

My test data is about 18GB of web pages, and the test app at the moment
just counts the number of web pages in each bundle file.  The map jobs
run just fine, but when it gets into the reduce, the TaskTrackers all
get "lost" to the JobTracker.  I can't see why, because the TaskTrackers
are all still running on the slaves.  Also, the jobdetails URL starts
returning an HTTP 500 error, although other links from that page still
work.

I've tried going onto the slaves and manually restarting the
tasktrackers with hadoop-daemon.sh, and also turning on job restarting
in the site conf and then running stop-mapred/start-mapred.  The
trackers start up and try to clean up and get going again, but they then
just get lost again.

Here's some error output from the master jobtracker:

2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200902021252_0002_r_000005_1' from 'tracker_darling:localhost.localdomain/127.0.0.1:58336'
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004592_1 is 796370 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004592_1 timed out.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004582_1 is 794199 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004582_1 timed out.
2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/127.0.0.1:52769'; resending the previous 'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/127.0.0.1:52808'; resending the previous 'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/127.0.0.1:54464'; Resending the previous 'lost' response
2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744'; resending the previous 'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/127.0.0.1:45749'; resending the previous 'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9
on 54311 caught: java.lang.NullPointerException
        at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123)
        at org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java
:48)
        at org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja
va:101)
        at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1
59)
        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907)

2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status from unknown Tracker : tracker_monocacy:localhost.localdomain/127.0.0.1:54464

And from a slave:

2009-02-02 13:26:39,440 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060, dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000111_0
2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on local exception: null
        at org.apache.hadoop.ipc.Client.call(Client.java:699)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
        at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164)
        at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1678)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211)
        at org.apache.hadoop.io.UTF8.readString(UTF8.java:203)
        at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:230)
        at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:171)
        at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:219)
        at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
        at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:506)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:438)

2009-02-02 13:41:40,165 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'rogue' with reponseId '835
2009-02-02 13:41:40,177 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200902021252_0002_r_000007_0: Task attempt_200902021252_0002_r_000007_0 failed to report status for 1600 seconds. Killing!
2009-02-02 13:41:40,287 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 6 bytes for reduce: 4 from map: attempt_200902021252_0002_m_000118_0 given 6/2 from 24 with (-1, -1)
2009-02-02 13:41:40,333 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060, dest: 129.6.101.12:39569, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000118_0
2009-02-02 13:41:40,337 INFO org.apache.hadoop.mapred.TaskTracker: Process Thread Dump: lost task

... and follows lots of stacks.  I'm not sure which error output is most
helpful for this report.

Anyway, I can see that the logs tell me that something went wrong on the
slave, and as a result the master has lost contact with the slave
tasktracker, which is still alive, just not running my job anymore.  Can
anyone tell me where to start looking?

Thanks,
Ian

Re: My tasktrackers keep getting lost...

Posted by Ian Soboroff <ia...@nist.gov>.

So staring at these logs a bit more and reading hadoop-default.xml and  
thinking a bit, it seems to me that for some reason my slave  
tasktrackers are having trouble sending heartbeats back to the  
master.  I'm not sure why this is.  It is happening during the shuffle  
phase of the reduce setup.

I could bump the heartbeat timeout but 600000 ms seems more than plenty.

Ian

Re: My tasktrackers keep getting lost...

Posted by Ian Soboroff <ia...@nist.gov>.

On Feb 2, 2009, at 11:38 PM, Sagar Naik wrote:

> Can u post the output from
> hadoop-argus-<hostname>-jobtracker.out

Sure:

Exception closing file /user/soboroff/output/_logs/history/ 
rogue_1233597148110_job_200902021252_0002_soboroff_index
java.io.IOException: Filesystem closed
	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
	at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
	at org.apache.hadoop.hdfs.DFSClient 
$DFSOutputStream.closeInternal(DFSClient.java:3084)
	at org.apache.hadoop.hdfs.DFSClient 
$DFSOutputStream.close(DFSClient.java:3053)
	at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java: 
942)
	at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210)
	at  
org 
.apache 
.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243)
	at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413)
	at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236)
	at org.apache.hadoop.fs.FileSystem 
$ClientFinalizer.run(FileSystem.java:221)

(Sorry if the job IDs don't exactly match between my cut-and-pasted  
logs.  I tried running this job several times, with the same result  
each time.)

Ian

Re: My tasktrackers keep getting lost...

Posted by Sagar Naik <sn...@attributor.com>.

Can u post the output from
hadoop-argus-<hostname>-jobtracker.out

-Sagar

jason hadoop wrote:
> When I was at Attributor we experienced periodic odd XFS hangs that would
> freeze up the Hadoop Server processes resulting in them going away.
> Sometimes XFS would deadlock all writes to the log file and the server would
> freeze trying to log a message. Can't even JSTACK the jvm.
> We never had any traction on resolving the XFS deadlocks and simply reboot
> the machines when the problem occured.
>
> On Mon, Feb 2, 2009 at 7:09 PM, Ian Soboroff <ia...@nist.gov> wrote:
>
>   
>> I hope someone can help me out.  I'm getting started with Hadoop,
>> have written the firt part of my project (a custom InputFormat), and am
>> now using that to test out my cluster setup.
>>
>> I'm running 0.19.0.  I have five dual-core Linux workstations with most
>> of a 250GB disk available for playing, and am controlling things from my
>> Mac Pro.  (This is not the production cluster, that hasn't been
>> assembled yet.  This is just to get the code working and figure out the
>> bumps.)
>>
>> My test data is about 18GB of web pages, and the test app at the moment
>> just counts the number of web pages in each bundle file.  The map jobs
>> run just fine, but when it gets into the reduce, the TaskTrackers all
>> get "lost" to the JobTracker.  I can't see why, because the TaskTrackers
>> are all still running on the slaves.  Also, the jobdetails URL starts
>> returning an HTTP 500 error, although other links from that page still
>> work.
>>
>> I've tried going onto the slaves and manually restarting the
>> tasktrackers with hadoop-daemon.sh, and also turning on job restarting
>> in the site conf and then running stop-mapred/start-mapred.  The
>> trackers start up and try to clean up and get going again, but they then
>> just get lost again.
>>
>> Here's some error output from the master jobtracker:
>>
>> 2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed
>> completed task 'attempt_200902021252_0002_r_000005_1' from
>> 'tracker_darling:localhost.localdomain/127.0.0.1:58336'
>> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
>> attempt_200902021252_0002_m_004592_1 is 796370 ms debug.
>> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
>> task attempt_200902021252_0002_m_004592_1 timed out.
>> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
>> attempt_200902021252_0002_m_004582_1 is 794199 ms debug.
>> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
>> task attempt_200902021252_0002_m_004582_1 timed out.
>> 2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
>> 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/
>> 127.0.0.1:52769'; resending the previous 'lost' response
>> 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
>> 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/
>> 127.0.0.1:52808'; resending the previous 'lost' response
>> 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
>> 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/
>> 127.0.0.1:54464'; Resending the previous 'lost' response
>> 2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
>> 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744';
>> resending the previous 'lost' response
>> 2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
>> 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/
>> 127.0.0.1:45749'; resending the previous 'lost' response
>> 2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 9
>> on 54311 caught: java.lang.NullPointerException
>>        at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123)
>>        at
>> org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java
>> :48)
>>        at
>> org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja
>> va:101)
>>        at
>> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1
>> 59)
>>        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907)
>>
>> 2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status
>> from unknown Tracker : tracker_monocacy:localhost.localdomain/
>> 127.0.0.1:54464
>>
>> And from a slave:
>>
>> 2009-02-02 13:26:39,440 INFO
>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060,
>> dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID:
>> attempt_200902021252_0002_m_000111_0
>> 2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
>> exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on
>> local exception: null
>>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
>>        at
>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164)
>>        at
>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997)
>>        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1678)
>>        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
>> Caused by: java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at
>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>        at
>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>        at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211)
>>        at org.apache.hadoop.io.UTF8.readString(UTF8.java:203)
>>        at
>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:230)
>>        at
>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:171)
>>        at
>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:219)
>>        at
>> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
>>        at
>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:506)
>>        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:438)
>>
>> 2009-02-02 13:41:40,165 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'rogue' with reponseId '835
>> 2009-02-02 13:41:40,177 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_200902021252_0002_r_000007_0: Task
>> attempt_200902021252_0002_r_000007_0 failed to report status for 1600
>> seconds. Killing!
>> 2009-02-02 13:41:40,287 INFO org.apache.hadoop.mapred.TaskTracker: Sent out
>> 6 bytes for reduce: 4 from map: attempt_200902021252_0002_m_000118_0 given
>> 6/2 from 24 with (-1, -1)
>> 2009-02-02 13:41:40,333 INFO
>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060,
>> dest: 129.6.101.12:39569, bytes: 6, op: MAPRED_SHUFFLE, cliID:
>> attempt_200902021252_0002_m_000118_0
>> 2009-02-02 13:41:40,337 INFO org.apache.hadoop.mapred.TaskTracker: Process
>> Thread Dump: lost task
>>
>> ... and follows lots of stacks.  I'm not sure which error output is most
>> helpful for this report.
>>
>> Anyway, I can see that the logs tell me that something went wrong on the
>> slave, and as a result the master has lost contact with the slave
>> tasktracker, which is still alive, just not running my job anymore.  Can
>> anyone tell me where to start looking?
>>
>> Thanks,
>> Ian
>>
>>
>>     
>
>

Re: My tasktrackers keep getting lost...

Posted by jason hadoop <ja...@gmail.com>.

When I was at Attributor we experienced periodic odd XFS hangs that would
freeze up the Hadoop Server processes resulting in them going away.
Sometimes XFS would deadlock all writes to the log file and the server would
freeze trying to log a message. Can't even JSTACK the jvm.
We never had any traction on resolving the XFS deadlocks and simply reboot
the machines when the problem occured.

On Mon, Feb 2, 2009 at 7:09 PM, Ian Soboroff <ia...@nist.gov> wrote:

>
> I hope someone can help me out.  I'm getting started with Hadoop,
> have written the firt part of my project (a custom InputFormat), and am
> now using that to test out my cluster setup.
>
> I'm running 0.19.0.  I have five dual-core Linux workstations with most
> of a 250GB disk available for playing, and am controlling things from my
> Mac Pro.  (This is not the production cluster, that hasn't been
> assembled yet.  This is just to get the code working and figure out the
> bumps.)
>
> My test data is about 18GB of web pages, and the test app at the moment
> just counts the number of web pages in each bundle file.  The map jobs
> run just fine, but when it gets into the reduce, the TaskTrackers all
> get "lost" to the JobTracker.  I can't see why, because the TaskTrackers
> are all still running on the slaves.  Also, the jobdetails URL starts
> returning an HTTP 500 error, although other links from that page still
> work.
>
> I've tried going onto the slaves and manually restarting the
> tasktrackers with hadoop-daemon.sh, and also turning on job restarting
> in the site conf and then running stop-mapred/start-mapred.  The
> trackers start up and try to clean up and get going again, but they then
> just get lost again.
>
> Here's some error output from the master jobtracker:
>
> 2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed
> completed task 'attempt_200902021252_0002_r_000005_1' from
> 'tracker_darling:localhost.localdomain/127.0.0.1:58336'
> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
> attempt_200902021252_0002_m_004592_1 is 796370 ms debug.
> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
> task attempt_200902021252_0002_m_004592_1 timed out.
> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
> attempt_200902021252_0002_m_004582_1 is 794199 ms debug.
> 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
> task attempt_200902021252_0002_m_004582_1 timed out.
> 2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
> 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/
> 127.0.0.1:52769'; resending the previous 'lost' response
> 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
> 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/
> 127.0.0.1:52808'; resending the previous 'lost' response
> 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
> 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/
> 127.0.0.1:54464'; Resending the previous 'lost' response
> 2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
> 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744';
> resending the previous 'lost' response
> 2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
> 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/
> 127.0.0.1:45749'; resending the previous 'lost' response
> 2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 9
> on 54311 caught: java.lang.NullPointerException
>        at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123)
>        at
> org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java
> :48)
>        at
> org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja
> va:101)
>        at
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1
> 59)
>        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907)
>
> 2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status
> from unknown Tracker : tracker_monocacy:localhost.localdomain/
> 127.0.0.1:54464
>
> And from a slave:
>
> 2009-02-02 13:26:39,440 INFO
> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060,
> dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID:
> attempt_200902021252_0002_m_000111_0
> 2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
> exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on
> local exception: null
>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
>        at
> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164)
>        at
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997)
>        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1678)
>        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
> Caused by: java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>        at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>        at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>        at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211)
>        at org.apache.hadoop.io.UTF8.readString(UTF8.java:203)
>        at
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:230)
>        at
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:171)
>        at
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:219)
>        at
> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
>        at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:506)
>        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:438)
>
> 2009-02-02 13:41:40,165 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'rogue' with reponseId '835
> 2009-02-02 13:41:40,177 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_200902021252_0002_r_000007_0: Task
> attempt_200902021252_0002_r_000007_0 failed to report status for 1600
> seconds. Killing!
> 2009-02-02 13:41:40,287 INFO org.apache.hadoop.mapred.TaskTracker: Sent out
> 6 bytes for reduce: 4 from map: attempt_200902021252_0002_m_000118_0 given
> 6/2 from 24 with (-1, -1)
> 2009-02-02 13:41:40,333 INFO
> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060,
> dest: 129.6.101.12:39569, bytes: 6, op: MAPRED_SHUFFLE, cliID:
> attempt_200902021252_0002_m_000118_0
> 2009-02-02 13:41:40,337 INFO org.apache.hadoop.mapred.TaskTracker: Process
> Thread Dump: lost task
>
> ... and follows lots of stacks.  I'm not sure which error output is most
> helpful for this report.
>
> Anyway, I can see that the logs tell me that something went wrong on the
> slave, and as a result the master has lost contact with the slave
> tasktracker, which is still alive, just not running my job anymore.  Can
> anyone tell me where to start looking?
>
> Thanks,
> Ian
>
>