You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Michael Stack <st...@archive.org> on 2006/03/08 21:22:12 UTC
More about 'hang' on invertlinks.
I've more information on below hang:
Looking in list of reduce tasks in jobdetails all seem to be hung
waiting on one host. Here is the status message that all of my 31
reduce tasks exhibit:
reduce > copy > task_m_5g00f1@ia109309.archive.org:50040
Looking over on the 309 machine, I see that as far as its tasktracker is
concerned, the task_m_5g00f1 completed fine:
060308 112952 parsing file:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml
060308 112952 task_m_5g00f1 1.0%
/user/stack/nara/outputs/segments/2006030721095
20:224000000+32000000
060308 112953 Task task_m_5g00f1 is done.
...
...and then it goes to do cleanup:
060308 122603 task_m_5g00f1 done; removing files.
5 seconds later I start seeing the likes of the below in the log and
they are unending:
060308 122626 Server handler 0 on 50040 caught:
java.io.FileNotFoundException: /
00f1/part-19.out
java.io.FileNotFoundException: /0/hadoop/tmp/task_m_5g00f1/part-19.out
at
org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:113)
at
org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.j
at
org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:106)
at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:215)
Looks like ipc Server handlers stuck trying to read parts since
removed. Things look a little confused.
Will keep digging... but any suggestions as to what might be going on or
things to try appreciated.
St.Ack
Michael Stack wrote:
> ...
>
> 3. Rack looks to be currently 'hung' on invertlinks. All reduce tasks
> show the exact same 0.25 or so complete. No emissions out of
> jobtracker in last 4 hours. Namenode log has block reports. All CPUs
> are quiescent -- even jobtracker. 5 reduce tasks had the below
> exception:
>
> 060308 132336 task_r_9swl2k Client connection to 207.241.228.28:8009:
> starting
> 060308 132336 task_r_9swl2k parsing
> file:/0/hadoop/nara/app/runtime-conf/hadoop-default.xml
> 060308 132336 task_r_9swl2k parsing
> file:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml
> 060308 132336 Server connection on port 50050 from 207.241.227.176:
> starting
> 060308 132336 task_r_9swl2k 0.75% reduce > reduce060308 132336
> task_r_9swl2k Client connection to 0.0.0.0:50050: starting
> 060308 132339 task_r_9swl2k Error running child060308 132339
> task_r_9swl2k java.lang.RuntimeException: java.io.EOFException
> 060308 132339 task_r_9swl2k at
> org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:132)
>
> 060308 132339 task_r_9swl2k at
> org.apache.nutch.crawl.LinkDb.reduce(LinkDb.java:108)
> 060308 132339 task_r_9swl2k at
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:283)
> 060308 132339 task_r_9swl2k at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)
> 060308 132339 task_r_9swl2k Caused by: java.io.EOFException
> 060308 132339 task_r_9swl2k at
> java.io.DataInputStream.readFully(DataInputStream.java:178)
> 060308 132339 task_r_9swl2k at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
>
> 060308 132339 task_r_9swl2k at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
> 060308 132339 task_r_9swl2k at
> org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
> 060308 132339 task_r_9swl2k at
> org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
> 060308 132339 task_r_9swl2k at
> org.apache.nutch.crawl.Inlink.readFields(Inlink.java:36)
> 060308 132339 task_r_9swl2k at
> org.apache.nutch.crawl.Inlink.read(Inlink.java:53) 060308 132339
> task_r_9swl2k at
> org.apache.nutch.crawl.Inlinks.readFields(Inlinks.java:44)
> 060308 132339 task_r_9swl2k at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:347)
> 060308 132339 task_r_9swl2k at
> org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:163)
>
> 060308 132339 task_r_9swl2k at
> org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:129)
>
> 060308 132339 task_r_9swl2k ... 3 more
> 060308 132340 Server connection on port 50050 from 207.241.227.176:
> exiting060308 132340 KILLING CHILD PROCESS task_r_9swl2k
>
> A thread dump from job tracker shows a bunch of threads in this state:
>
> Full thread dump Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode):
> "Server connection on port 8010 from xxx.xxx.xxx.xxx" daemon prio=1
> tid=0xad324720 nid=0x2074 runnable [0xac07e000..0xac07ee40] at
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> - locked <0xb4570d08> (a java.io.BufferedInputStream)
> at java.io.DataInputStream.readInt(DataInputStream.java:353)
> at org.apache.hadoop.ipc.Server$Connection.run(Server.java:129)
>
> If I connect to the mentioned tasktrackers, the tasktrackers of IP
> xxx.xxx.xxx.xxx from the jobtracker thread dump, no children are
> running....
>
> Any pointers appreciated. Meantime will keep digging.
>
> St.Ack