You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Chris Schneider <Sc...@TransPac.com> on 2006/11/15 19:52:26 UTC

DFS/RPC problems

Fellow Hadoop coders,

I'm getting tragic DFS/RPC problems running Nutch 0.8+ atop Hadoop 
0.5.1+ that seem similar to problems others have been reporting 
(e.g., HADOOP-707). My great hope was that the HADOOP-563 patch would 
resolve them, and it may well have improved things somewhat. However, 
I'm still unable to complete the crawl I've been attempting even 
after rolling in this patch. A few notes about my setup that may be 
relevant:

1) I'm running a crawl through a proxy server, and I've modified 
Nutch so that it skips the server delay whenever it gets a hit in the 
proxy cache. Since I've been attempting the same crawl over and over 
again, the first part of it may run so fast that it taxes the DFS and 
RPC in a way that exposes fundamental problems.

2) I'm injecting 26M URLs into a brand new DB and then trying to 
fetch them all at once. I've tuned the fetcher.threads.fetch to 500 
in order to avoid fatal memory swapping on my cluster of 10 slaves (2 
tasktrackers per slave).

3) I've been using hadoop job -kill to stop the fetch job when I 
notice things not going the way I want (not necessarily catastrophic 
problems). I wonder whether this (in combination with the heavy 
loading I've applied to hadoop) may result in my DFS becoming 
"unusable".

4) Once my DFS becomes "unusable", I can stop-all.sh/start-all.sh, 
reboot my systems, etc., but nothing will give me back a cluster I 
can reliably use to do even relatively simple things like hadoop dfs 
-du. The start-all.sh script takes a very long time to execute, but 
hadoop eventually does come up, including its web server. The hadoop 
fsck / command (unless it dies do to an RPC problem) also reports 
that my DFS is healthy, though things are very sick indeed. Even 
simple DFS commands either execute slowly or die with RPC problems 
(SocketTimeoutException). Crawling at this point may also get started 
OK, though the fetch will eventually die.

5) Part of the modifications we've made to Nutch are a message queue 
subsystem developed by Andrzej. Since this doesn't use NIO to do any 
of its communication, I doubt that it is interfering with hadoop's 
RPC. However, it does a fair amount of writing to the DFS itself, and 
many of the error messages in the logs are associated with these 
requests.

6) To resolve the "unusable" DFS problem, I'm forced to use the local 
FS to delete all of the DFS contents (on both master and slaves), and 
then I tell the namenode to format the DFS. I also wonder whether 
this operation may leave things in some kind of strange state, though 
I can't imagine what might be the problem.

After resorting to step #6 above yesterday, I performed another test 
and conducted a fairly detailed analysis of the log files. I injected 
my 26M URLs into a new crawldb, generated a fetch list (without 
specifying any topN) and started fetching it. After over an hour of 
fetching, I noticed that I wanted to do a little more logging on the 
Nutch side, so I killed the fetch job, did a stop-all.sh, deployed a 
modified version of Nutch, did a start-all.sh, and tried my crawl 
again from scratch (i.e., I injected my 26M URLs into a brand new 
crawldb, etc.) I let this run until the fetch job died, then mined 
the logs for the following information.

Any help our insight you could provide would be greatly appreciated.

Best Regards,

- Chris

-------------------------------------------------------------------------------------------
I didn't find anything unusual in the stdout for start-all.sh.

The namenode may have had trouble removing files immediately after it was first
launched, but this is probably just housekeeping (making sure that it deletes
any files that are sometimes left over from previous runs):

   2006-11-14 11:13:23,276 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/tmp/hadoop/mapred/.system.crc because it does not exist
   2006-11-14 11:13:23,280 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/tmp/hadoop/mapred/system because it does not exist

Immediately after the injection for the first crawl began
   2006-11-14 11:45:45,141 INFO  crawl.Injector - Injector: starting
there is a significant(?) DFS block issue in the s1 datanode's log:

   2006-11-14 11:45:49,021 WARN  dfs.DataNode - DataXCeiver
   java.io.IOException: Block blk_452376043108156205 has already been 
started (though not completed), and thus cannot be created.
           at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:298)
           at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:663) 

           at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:507)
           at java.lang.Thread.run(Thread.java:595)

This problem is reflected in the requesting (s7) datanode's log as well and
apparently results in its connection to the namenode getting reset(?):

   2006-11-14 11:45:49,010 WARN  dfs.DataNode - Failed to transfer 
blk_452376043108156205 to s1-crawlera/10.10.16.101:50010
   java.net.SocketException: Connection reset
           at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
           at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
           at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
           at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
           at java.io.DataOutputStream.write(DataOutputStream.java:90)
           at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:912)
           at java.lang.Thread.run(Thread.java:595)

(There are three of these events in s7's datanode log during the injection,
generation and fetching, which seem to be correlated with some of the message
queue subsystem problems mentioned below.)

As the injection for the first crawl completed
   2006-11-14 12:06:13,613 INFO  crawl.Injector - Injector: done
there are a couple of unimportant(?) messages that I can attribute to a brand
new crawldb:

   2006-11-14 12:06:13,603 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/user/crawler/crawl-20061114114341/crawldb/.old.crc because it does 
not exist
   2006-11-14 12:06:13,603 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/user/crawler/crawl-20061114114341/crawldb/old because it does not 
exist
   2006-11-14 12:06:13,604 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedRenameTo: failed to rename 
/user/crawler/crawl-20061114114341/crawldb/current to 
/user/crawler/crawl-20061114114341/crawldb/old because source does 
not exist
   2006-11-14 12:06:13,607 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/user/crawler/crawl-20061114114341/crawldb/.old.crc because it does 
not exist
   2006-11-14 12:06:13,608 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/user/crawler/crawl-20061114114341/crawldb/old because it does not 
exist

Once the fetcher job got going
   2006-11-14 12:21:30,654 INFO  fetcher.Fetcher - Fetcher: segment: 
/user/crawler/crawl-20061114114341/segments/20061114120615
Andrzej's message queue subsystem apparently had tons of trouble creating
various files, with messages getting spit out every minute or so:

   2006-11-14 12:21:54,827 WARN  dfs.StateChange - DIR* 
NameSystem.startFile: failed to create file 
/mq/alerts/fetcher/.1163535714813.s7-crawlera.crc for 
DFSClient_task_0005_m_000010_0 on client s7-crawlera because 
pendingCreates is non-null.
   ...
   2006-11-14 13:10:50,457 WARN  dfs.StateChange - DIR* 
NameSystem.startFile: failed to create file 
/mq/alerts/fetcher/.1163538650385.s7-crawlera.crc for 
DFSClient_task_0005_m_000019_0 on client s7-crawlera because 
pendingCreates is non-null.

These may be correlated with entries in the datanode logs:

   2006-11-14 13:10:46,387 WARN  dfs.DataNode - DataXCeiver
   java.io.IOException: Block blk_-8970705341808748626 is valid, and 
cannot be written to.
           at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:285)
           at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:663)
           at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:507)
           at java.lang.Thread.run(Thread.java:595)

The namenode also noticed a few redundant addStoredBlock requests:

   2006-11-14 13:10:56,103 WARN  dfs.StateChange - BLOCK* 
NameSystem.addStoredBlock: Redundant addStoredBlock request received 
for blk_-4818305081009377296 on s10-crawlera:50010

But I *don't* find anything in the datanode logs mentioning these blocks.

Finally (perhaps in response to me killing this first fetch job?), we see our
first ClosedChannelException in the namenode log:

   2006-11-14 13:40:44,634 WARN  ipc.Server - handler output error
   java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:125)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)

This is associated with "unknown opcode" messages in most of the datanode logs:

   s3-crawlera.prod.krugle.net: 2006-11-14 13:40:43,678 WARN 
dfs.DataNode - DataXCeiver
   s3-crawlera.prod.krugle.net: java.io.IOException: Unknown opcode 
for incoming data stream
   s3-crawlera.prod.krugle.net:    at 
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:515)
   s3-crawlera.prod.krugle.net:    at java.lang.Thread.run(Thread.java:595)


--------------------------------------------------------------------------------
Here I did a stop-all.sh, deployed a version of nutch with a bit more logging,
and then did a start-all.sh
--------------------------------------------------------------------------------


The jobtracker log shows that it was apparently unable to connect to the
namenode just after it gets launched the second time. The connection error
repeats over and over again, but eventually we get a few messages about the
web server starting up that may indicate that the jobtracker was finally able
to connect to the namenode(?)

   2006-11-14 13:59:52,321 WARN  mapred.JobTracker - Starting tracker
   java.net.ConnectException: Connection refused
     at java.net.PlainSocketImpl.socketConnect(Native Method)
     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
     at java.net.Socket.connect(Socket.java:507)
     at java.net.Socket.connect(Socket.java:457)
     at java.net.Socket.<init>(Socket.java:365)
     at java.net.Socket.<init>(Socket.java:207)
     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
     at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
     at org.apache.hadoop.ipc.Client.call(Client.java:297)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
     at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
     at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
     at 
org.apache.hadoop.dfs.DistributedFileSystem.<init>(DistributedFileSystem.java:47)
     at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
     at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
     at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
     at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
   ...
   2006-11-14 14:00:18,352 WARN  mapred.JobTracker - Starting tracker
   java.net.ConnectException: Connection refused
     at java.net.PlainSocketImpl.socketConnect(Native Method)
     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
     at java.net.Socket.connect(Socket.java:507)
     at java.net.Socket.connect(Socket.java:457)
     at java.net.Socket.<init>(Socket.java:365)
     at java.net.Socket.<init>(Socket.java:207)
     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
     at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
     at org.apache.hadoop.ipc.Client.call(Client.java:297)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
     at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
     at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
     at 
org.apache.hadoop.dfs.DistributedFileSystem.<init>(DistributedFileSystem.java:47)
     at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
     at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
     at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
     at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
   2006-11-14 14:00:19,516 INFO  util.Credential - Checking Resource aliases
   2006-11-14 14:00:19,534 INFO  http.HttpServer - Version Jetty/5.1.4
   2006-11-14 14:00:20,423 INFO  util.Container - Started 
org.mortbay.jetty.servlet.WebApplicationHandler@69267649
   2006-11-14 14:00:20,492 INFO  util.Container - Started 
WebApplicationContext[/,/]
   2006-11-14 14:00:20,493 INFO  util.Container - Started 
HttpContext[/logs,/logs]
   2006-11-14 14:00:20,493 INFO  util.Container - Started 
HttpContext[/static,/static]
   2006-11-14 14:00:20,496 INFO  http.SocketListener - Started 
SocketListener on 0.0.0.0:50030
   2006-11-14 14:00:20,496 INFO  util.Container - Started 
org.mortbay.jetty.Server@b6e39f

Once the second crawl gets started
   2006-11-14 14:02:03,392 INFO  crawl.Injector - Injector: starting
the same issues with the new crawldb, the message queue, and redundant blocks
mentioned above appear again in the namenode log (not shown).

The DFS/RPC problems finally begin to surface in the jobtracker log (and these
bubble up to tool output) several hours into the second crawl:

   2006-11-14 17:07:27,539 WARN  fs.DFSClient - Problem renewing lease 
for DFSClient_-284911018: java.net.SocketTimeoutException: timed out 
waiting for rpc response
     at org.apache.hadoop.ipc.Client.call(Client.java:312)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
     at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
     at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
     at java.lang.Thread.run(Thread.java:595)

Nothing special appears during this same timeframe in the namenode log (just
the continuing message queue and redundant blocks errors).

Finally, there is an indication in the jobtracker log (embedded in the
continuing DFS/RPC problems) that the fetch job has died and is being garbage
collected:

   2006-11-14 18:46:37,681 WARN  fs.DFSClient - Problem renewing lease 
for DFSClient_-284911018: java.net.SocketTimeoutException: timed out 
waiting for rpc response
     at org.apache.hadoop.ipc.Client.call(Client.java:312)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
     at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
     at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
     at java.lang.Thread.run(Thread.java:595)

   2006-11-14 18:47:37,339 WARN  mapred.JobInProgress - Error cleaning 
up job_0005: java.net.SocketTimeoutException: timed out waiting for 
rpc response
   2006-11-14 18:47:38,696 WARN  fs.DFSClient - Problem renewing lease 
for DFSClient_-284911018: java.net.SocketTimeoutException: timed out 
waiting for rpc response
     at org.apache.hadoop.ipc.Client.call(Client.java:312)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
     at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
     at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
     at java.lang.Thread.run(Thread.java:595)

About 20 minutes later, we finally see the first of a series of broken pipe
exceptions in the namenode log:

   2006-11-14 19:09:54,440 WARN  ipc.Server - handler output error
   java.io.IOException: Broken pipe
     at sun.nio.ch.FileDispatcher.write0(Native Method)
     at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
     at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
     at sun.nio.ch.IOUtil.write(IOUtil.java:60)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:302)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)

About 30 minutes later, the first of a series of closed channel exceptions
appears in the namenode log (along with an indication that the namenode may
have lost connection to all but one of its datanodes):

   2006-11-14 19:39:43,142 WARN  ipc.Server - handler output error
   java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:125)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
   2006-11-14 19:39:43,142 WARN  fs.FSNamesystem - Replication 
requested of 3 is larger than cluster size (1). Using cluster size.

The namenode process eventually dies, though the jobtracker survives (as do
all of the datanodes and tasktrackers on the slaves).
-- 
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------

Re: DFS/RPC problems

Posted by Chris Schneider <Sc...@TransPac.com>.
Eric,

>How many files are you creating in HDFS?
>
>I wonder if the messaging package may simply be swamping the namenode with many, many small files?  It really is not designed to deal with that use case.  How big are your checkpoint files?

Thanks for your excellent suggestion! It appears that this was indeed causing the problem. I will work with Andrzej to come up with a solution to limit the number of files created by his message queue subsystem.

Best Regards,

- Chris

>On Nov 15, 2006, at 10:52 AM, Chris Schneider wrote:
>
>>Fellow Hadoop coders,
>>
>>I'm getting tragic DFS/RPC problems running Nutch 0.8+ atop Hadoop 0.5.1+ that seem similar to problems others have been reporting (e.g., HADOOP-707). My great hope was that the HADOOP-563 patch would resolve them, and it may well have improved things somewhat. However, I'm still unable to complete the crawl I've been attempting even after rolling in this patch. A few notes about my setup that may be relevant:
>>
>>1) I'm running a crawl through a proxy server, and I've modified Nutch so that it skips the server delay whenever it gets a hit in the proxy cache. Since I've been attempting the same crawl over and over again, the first part of it may run so fast that it taxes the DFS and RPC in a way that exposes fundamental problems.
>>
>>2) I'm injecting 26M URLs into a brand new DB and then trying to fetch them all at once. I've tuned the fetcher.threads.fetch to 500 in order to avoid fatal memory swapping on my cluster of 10 slaves (2 tasktrackers per slave).
>>
>>3) I've been using hadoop job -kill to stop the fetch job when I notice things not going the way I want (not necessarily catastrophic problems). I wonder whether this (in combination with the heavy loading I've applied to hadoop) may result in my DFS becoming "unusable".
>>
>>4) Once my DFS becomes "unusable", I can stop-all.sh/start-all.sh, reboot my systems, etc., but nothing will give me back a cluster I can reliably use to do even relatively simple things like hadoop dfs -du. The start-all.sh script takes a very long time to execute, but hadoop eventually does come up, including its web server. The hadoop fsck / command (unless it dies do to an RPC problem) also reports that my DFS is healthy, though things are very sick indeed. Even simple DFS commands either execute slowly or die with RPC problems (SocketTimeoutException). Crawling at this point may also get started OK, though the fetch will eventually die.
>>
>>5) Part of the modifications we've made to Nutch are a message queue subsystem developed by Andrzej. Since this doesn't use NIO to do any of its communication, I doubt that it is interfering with hadoop's RPC. However, it does a fair amount of writing to the DFS itself, and many of the error messages in the logs are associated with these requests.
>>
>>6) To resolve the "unusable" DFS problem, I'm forced to use the local FS to delete all of the DFS contents (on both master and slaves), and then I tell the namenode to format the DFS. I also wonder whether this operation may leave things in some kind of strange state, though I can't imagine what might be the problem.
>>
>>After resorting to step #6 above yesterday, I performed another test and conducted a fairly detailed analysis of the log files. I injected my 26M URLs into a new crawldb, generated a fetch list (without specifying any topN) and started fetching it. After over an hour of fetching, I noticed that I wanted to do a little more logging on the Nutch side, so I killed the fetch job, did a stop-all.sh, deployed a modified version of Nutch, did a start-all.sh, and tried my crawl again from scratch (i.e., I injected my 26M URLs into a brand new crawldb, etc.) I let this run until the fetch job died, then mined the logs for the following information.
>>
>>Any help our insight you could provide would be greatly appreciated.
>>
>>Best Regards,
>>
>>- Chris
>>
>>-------------------------------------------------------------------------------------------
>>I didn't find anything unusual in the stdout for start-all.sh.
>>
>>The namenode may have had trouble removing files immediately after it was first
>>launched, but this is probably just housekeeping (making sure that it deletes
>>any files that are sometimes left over from previous runs):
>>
>>  2006-11-14 11:13:23,276 WARN  dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /tmp/hadoop/mapred/.system.crc because it does not exist
>>  2006-11-14 11:13:23,280 WARN  dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /tmp/hadoop/mapred/system because it does not exist
>>
>>Immediately after the injection for the first crawl began
>>  2006-11-14 11:45:45,141 INFO  crawl.Injector - Injector: starting
>>there is a significant(?) DFS block issue in the s1 datanode's log:
>>
>>  2006-11-14 11:45:49,021 WARN  dfs.DataNode - DataXCeiver
>>  java.io.IOException: Block blk_452376043108156205 has already been started (though not completed), and thus cannot be created.
>>          at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:298)
>>          at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:663)
>>          at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:507)
>>          at java.lang.Thread.run(Thread.java:595)
>>
>>This problem is reflected in the requesting (s7) datanode's log as well and
>>apparently results in its connection to the namenode getting reset(?):
>>
>>  2006-11-14 11:45:49,010 WARN  dfs.DataNode - Failed to transfer blk_452376043108156205 to s1-crawlera/10.10.16.101:50010
>>  java.net.SocketException: Connection reset
>>          at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>>          at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>          at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>          at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>          at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>          at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:912)
>>          at java.lang.Thread.run(Thread.java:595)
>>
>>(There are three of these events in s7's datanode log during the injection,
>>generation and fetching, which seem to be correlated with some of the message
>>queue subsystem problems mentioned below.)
>>
>>As the injection for the first crawl completed
>>  2006-11-14 12:06:13,613 INFO  crawl.Injector - Injector: done
>>there are a couple of unimportant(?) messages that I can attribute to a brand
>>new crawldb:
>>
>>  2006-11-14 12:06:13,603 WARN  dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /user/crawler/crawl-20061114114341/crawldb/.old.crc because it does not exist
>>  2006-11-14 12:06:13,603 WARN  dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /user/crawler/crawl-20061114114341/crawldb/old because it does not exist
>>  2006-11-14 12:06:13,604 WARN  dfs.StateChange - DIR* FSDirectory.unprotectedRenameTo: failed to rename /user/crawler/crawl-20061114114341/crawldb/current to /user/crawler/crawl-20061114114341/crawldb/old because source does not exist
>>  2006-11-14 12:06:13,607 WARN  dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /user/crawler/crawl-20061114114341/crawldb/.old.crc because it does not exist
>>  2006-11-14 12:06:13,608 WARN  dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /user/crawler/crawl-20061114114341/crawldb/old because it does not exist
>>
>>Once the fetcher job got going
>>  2006-11-14 12:21:30,654 INFO  fetcher.Fetcher - Fetcher: segment: /user/crawler/crawl-20061114114341/segments/20061114120615
>>Andrzej's message queue subsystem apparently had tons of trouble creating
>>various files, with messages getting spit out every minute or so:
>>
>>  2006-11-14 12:21:54,827 WARN  dfs.StateChange - DIR* NameSystem.startFile: failed to create file /mq/alerts/fetcher/.1163535714813.s7-crawlera.crc for DFSClient_task_0005_m_000010_0 on client s7-crawlera because pendingCreates is non-null.
>>  ...
>>  2006-11-14 13:10:50,457 WARN  dfs.StateChange - DIR* NameSystem.startFile: failed to create file /mq/alerts/fetcher/.1163538650385.s7-crawlera.crc for DFSClient_task_0005_m_000019_0 on client s7-crawlera because pendingCreates is non-null.
>>
>>These may be correlated with entries in the datanode logs:
>>
>>  2006-11-14 13:10:46,387 WARN  dfs.DataNode - DataXCeiver
>>  java.io.IOException: Block blk_-8970705341808748626 is valid, and cannot be written to.
>>          at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:285)
>>          at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:663)
>>          at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:507)
>>          at java.lang.Thread.run(Thread.java:595)
>>
>>The namenode also noticed a few redundant addStoredBlock requests:
>>
>>  2006-11-14 13:10:56,103 WARN  dfs.StateChange - BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_-4818305081009377296 on s10-crawlera:50010
>>
>>But I *don't* find anything in the datanode logs mentioning these blocks.
>>
>>Finally (perhaps in response to me killing this first fetch job?), we see our
>>first ClosedChannelException in the namenode log:
>>
>>  2006-11-14 13:40:44,634 WARN  ipc.Server - handler output error
>>  java.nio.channels.ClosedChannelException
>>    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:125)
>>    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
>>    at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
>>    at org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
>>    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>>    at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>>
>>This is associated with "unknown opcode" messages in most of the datanode logs:
>>
>>  s3-crawlera.prod.krugle.net: 2006-11-14 13:40:43,678 WARN dfs.DataNode - DataXCeiver
>>  s3-crawlera.prod.krugle.net: java.io.IOException: Unknown opcode for incoming data stream
>>  s3-crawlera.prod.krugle.net:    at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:515)
>>  s3-crawlera.prod.krugle.net:    at java.lang.Thread.run(Thread.java:595)
>>
>>
>>--------------------------------------------------------------------------------
>>Here I did a stop-all.sh, deployed a version of nutch with a bit more logging,
>>and then did a start-all.sh
>>--------------------------------------------------------------------------------
>>
>>
>>The jobtracker log shows that it was apparently unable to connect to the
>>namenode just after it gets launched the second time. The connection error
>>repeats over and over again, but eventually we get a few messages about the
>>web server starting up that may indicate that the jobtracker was finally able
>>to connect to the namenode(?)
>>
>>  2006-11-14 13:59:52,321 WARN  mapred.JobTracker - Starting tracker
>>  java.net.ConnectException: Connection refused
>>    at java.net.PlainSocketImpl.socketConnect(Native Method)
>>    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>>    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>>    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>>    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>>    at java.net.Socket.connect(Socket.java:507)
>>    at java.net.Socket.connect(Socket.java:457)
>>    at java.net.Socket.<init>(Socket.java:365)
>>    at java.net.Socket.<init>(Socket.java:207)
>>    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
>>    at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
>>    at org.apache.hadoop.ipc.Client.call(Client.java:297)
>>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>>    at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
>>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
>>    at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
>>    at org.apache.hadoop.dfs.DistributedFileSystem.<init>(DistributedFileSystem.java:47)
>>    at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
>>    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
>>    at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
>>    at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
>>    at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
>>  ...
>>  2006-11-14 14:00:18,352 WARN  mapred.JobTracker - Starting tracker
>>  java.net.ConnectException: Connection refused
>>    at java.net.PlainSocketImpl.socketConnect(Native Method)
>>    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>>    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>>    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>>    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>>    at java.net.Socket.connect(Socket.java:507)
>>    at java.net.Socket.connect(Socket.java:457)
>>    at java.net.Socket.<init>(Socket.java:365)
>>    at java.net.Socket.<init>(Socket.java:207)
>>    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
>>    at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
>>    at org.apache.hadoop.ipc.Client.call(Client.java:297)
>>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>>    at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
>>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
>>    at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
>>    at org.apache.hadoop.dfs.DistributedFileSystem.<init>(DistributedFileSystem.java:47)
>>    at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
>>    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
>>    at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
>>    at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
>>    at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
>>  2006-11-14 14:00:19,516 INFO  util.Credential - Checking Resource aliases
>>  2006-11-14 14:00:19,534 INFO  http.HttpServer - Version Jetty/5.1.4
>>  2006-11-14 14:00:20,423 INFO  util.Container - Started org.mortbay.jetty.servlet.WebApplicationHandler@69267649
>>  2006-11-14 14:00:20,492 INFO  util.Container - Started WebApplicationContext[/,/]
>>  2006-11-14 14:00:20,493 INFO  util.Container - Started HttpContext[/logs,/logs]
>>  2006-11-14 14:00:20,493 INFO  util.Container - Started HttpContext[/static,/static]
>>  2006-11-14 14:00:20,496 INFO  http.SocketListener - Started SocketListener on 0.0.0.0:50030
>>  2006-11-14 14:00:20,496 INFO  util.Container - Started org.mortbay.jetty.Server@b6e39f
>>
>>Once the second crawl gets started
>>  2006-11-14 14:02:03,392 INFO  crawl.Injector - Injector: starting
>>the same issues with the new crawldb, the message queue, and redundant blocks
>>mentioned above appear again in the namenode log (not shown).
>>
>>The DFS/RPC problems finally begin to surface in the jobtracker log (and these
>>bubble up to tool output) several hours into the second crawl:
>>
>>  2006-11-14 17:07:27,539 WARN  fs.DFSClient - Problem renewing lease for DFSClient_-284911018: java.net.SocketTimeoutException: timed out waiting for rpc response
>>    at org.apache.hadoop.ipc.Client.call(Client.java:312)
>>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>>    at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
>>    at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
>>    at java.lang.Thread.run(Thread.java:595)
>>
>>Nothing special appears during this same timeframe in the namenode log (just
>>the continuing message queue and redundant blocks errors).
>>
>>Finally, there is an indication in the jobtracker log (embedded in the
>>continuing DFS/RPC problems) that the fetch job has died and is being garbage
>>collected:
>>
>>  2006-11-14 18:46:37,681 WARN  fs.DFSClient - Problem renewing lease for DFSClient_-284911018: java.net.SocketTimeoutException: timed out waiting for rpc response
>>    at org.apache.hadoop.ipc.Client.call(Client.java:312)
>>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>>    at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
>>    at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
>>    at java.lang.Thread.run(Thread.java:595)
>>
>>  2006-11-14 18:47:37,339 WARN  mapred.JobInProgress - Error cleaning up job_0005: java.net.SocketTimeoutException: timed out waiting for rpc response
>>  2006-11-14 18:47:38,696 WARN  fs.DFSClient - Problem renewing lease for DFSClient_-284911018: java.net.SocketTimeoutException: timed out waiting for rpc response
>>    at org.apache.hadoop.ipc.Client.call(Client.java:312)
>>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>>    at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
>>    at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
>>    at java.lang.Thread.run(Thread.java:595)
>>
>>About 20 minutes later, we finally see the first of a series of broken pipe
>>exceptions in the namenode log:
>>
>>  2006-11-14 19:09:54,440 WARN  ipc.Server - handler output error
>>  java.io.IOException: Broken pipe
>>    at sun.nio.ch.FileDispatcher.write0(Native Method)
>>    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>>    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
>>    at sun.nio.ch.IOUtil.write(IOUtil.java:60)
>>    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:302)
>>    at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
>>    at org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
>>    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>>    at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>>
>>About 30 minutes later, the first of a series of closed channel exceptions
>>appears in the namenode log (along with an indication that the namenode may
>>have lost connection to all but one of its datanodes):
>>
>>  2006-11-14 19:39:43,142 WARN  ipc.Server - handler output error
>>  java.nio.channels.ClosedChannelException
>>    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:125)
>>    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
>>    at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
>>    at org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
>>    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>>    at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>>  2006-11-14 19:39:43,142 WARN  fs.FSNamesystem - Replication requested of 3 is larger than cluster size (1). Using cluster size.
>>
>>The namenode process eventually dies, though the jobtracker survives (as do
>>all of the datanodes and tasktrackers on the slaves).
>>--
>>------------------------
>>Chris Schneider
>>TransPac Software, Inc.
>>Schmed@TransPac.com
>>------------------------


-- 
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------

Re: DFS/RPC problems

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.
How many files are you creating in HDFS?

I wonder if the messaging package may simply be swamping the namenode  
with many, many small files?  It really is not designed to deal with  
that use case.  How big are your checkpoint files?

On Nov 15, 2006, at 10:52 AM, Chris Schneider wrote:

> Fellow Hadoop coders,
>
> I'm getting tragic DFS/RPC problems running Nutch 0.8+ atop Hadoop  
> 0.5.1+ that seem similar to problems others have been reporting  
> (e.g., HADOOP-707). My great hope was that the HADOOP-563 patch  
> would resolve them, and it may well have improved things somewhat.  
> However, I'm still unable to complete the crawl I've been  
> attempting even after rolling in this patch. A few notes about my  
> setup that may be relevant:
>
> 1) I'm running a crawl through a proxy server, and I've modified  
> Nutch so that it skips the server delay whenever it gets a hit in  
> the proxy cache. Since I've been attempting the same crawl over and  
> over again, the first part of it may run so fast that it taxes the  
> DFS and RPC in a way that exposes fundamental problems.
>
> 2) I'm injecting 26M URLs into a brand new DB and then trying to  
> fetch them all at once. I've tuned the fetcher.threads.fetch to 500  
> in order to avoid fatal memory swapping on my cluster of 10 slaves  
> (2 tasktrackers per slave).
>
> 3) I've been using hadoop job -kill to stop the fetch job when I  
> notice things not going the way I want (not necessarily  
> catastrophic problems). I wonder whether this (in combination with  
> the heavy loading I've applied to hadoop) may result in my DFS  
> becoming "unusable".
>
> 4) Once my DFS becomes "unusable", I can stop-all.sh/start-all.sh,  
> reboot my systems, etc., but nothing will give me back a cluster I  
> can reliably use to do even relatively simple things like hadoop  
> dfs -du. The start-all.sh script takes a very long time to execute,  
> but hadoop eventually does come up, including its web server. The  
> hadoop fsck / command (unless it dies do to an RPC problem) also  
> reports that my DFS is healthy, though things are very sick indeed.  
> Even simple DFS commands either execute slowly or die with RPC  
> problems (SocketTimeoutException). Crawling at this point may also  
> get started OK, though the fetch will eventually die.
>
> 5) Part of the modifications we've made to Nutch are a message  
> queue subsystem developed by Andrzej. Since this doesn't use NIO to  
> do any of its communication, I doubt that it is interfering with  
> hadoop's RPC. However, it does a fair amount of writing to the DFS  
> itself, and many of the error messages in the logs are associated  
> with these requests.
>
> 6) To resolve the "unusable" DFS problem, I'm forced to use the  
> local FS to delete all of the DFS contents (on both master and  
> slaves), and then I tell the namenode to format the DFS. I also  
> wonder whether this operation may leave things in some kind of  
> strange state, though I can't imagine what might be the problem.
>
> After resorting to step #6 above yesterday, I performed another  
> test and conducted a fairly detailed analysis of the log files. I  
> injected my 26M URLs into a new crawldb, generated a fetch list  
> (without specifying any topN) and started fetching it. After over  
> an hour of fetching, I noticed that I wanted to do a little more  
> logging on the Nutch side, so I killed the fetch job, did a stop- 
> all.sh, deployed a modified version of Nutch, did a start-all.sh,  
> and tried my crawl again from scratch (i.e., I injected my 26M URLs  
> into a brand new crawldb, etc.) I let this run until the fetch job  
> died, then mined the logs for the following information.
>
> Any help our insight you could provide would be greatly appreciated.
>
> Best Regards,
>
> - Chris
>
> ---------------------------------------------------------------------- 
> ---------------------
> I didn't find anything unusual in the stdout for start-all.sh.
>
> The namenode may have had trouble removing files immediately after  
> it was first
> launched, but this is probably just housekeeping (making sure that  
> it deletes
> any files that are sometimes left over from previous runs):
>
>   2006-11-14 11:13:23,276 WARN  dfs.StateChange - DIR*  
> FSDirectory.unprotectedDelete: failed to remove /tmp/hadoop/ 
> mapred/.system.crc because it does not exist
>   2006-11-14 11:13:23,280 WARN  dfs.StateChange - DIR*  
> FSDirectory.unprotectedDelete: failed to remove /tmp/hadoop/mapred/ 
> system because it does not exist
>
> Immediately after the injection for the first crawl began
>   2006-11-14 11:45:45,141 INFO  crawl.Injector - Injector: starting
> there is a significant(?) DFS block issue in the s1 datanode's log:
>
>   2006-11-14 11:45:49,021 WARN  dfs.DataNode - DataXCeiver
>   java.io.IOException: Block blk_452376043108156205 has already  
> been started (though not completed), and thus cannot be created.
>           at org.apache.hadoop.dfs.FSDataset.writeToBlock 
> (FSDataset.java:298)
>           at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock 
> (DataNode.java:663)
>           at org.apache.hadoop.dfs.DataNode$DataXceiver.run 
> (DataNode.java:507)
>           at java.lang.Thread.run(Thread.java:595)
>
> This problem is reflected in the requesting (s7) datanode's log as  
> well and
> apparently results in its connection to the namenode getting reset(?):
>
>   2006-11-14 11:45:49,010 WARN  dfs.DataNode - Failed to transfer  
> blk_452376043108156205 to s1-crawlera/10.10.16.101:50010
>   java.net.SocketException: Connection reset
>           at java.net.SocketOutputStream.socketWrite 
> (SocketOutputStream.java:96)
>           at java.net.SocketOutputStream.write 
> (SocketOutputStream.java:136)
>           at java.io.BufferedOutputStream.flushBuffer 
> (BufferedOutputStream.java:65)
>           at java.io.BufferedOutputStream.write 
> (BufferedOutputStream.java:109)
>           at java.io.DataOutputStream.write(DataOutputStream.java:90)
>           at org.apache.hadoop.dfs.DataNode$DataTransfer.run 
> (DataNode.java:912)
>           at java.lang.Thread.run(Thread.java:595)
>
> (There are three of these events in s7's datanode log during the  
> injection,
> generation and fetching, which seem to be correlated with some of  
> the message
> queue subsystem problems mentioned below.)
>
> As the injection for the first crawl completed
>   2006-11-14 12:06:13,613 INFO  crawl.Injector - Injector: done
> there are a couple of unimportant(?) messages that I can attribute  
> to a brand
> new crawldb:
>
>   2006-11-14 12:06:13,603 WARN  dfs.StateChange - DIR*  
> FSDirectory.unprotectedDelete: failed to remove /user/crawler/ 
> crawl-20061114114341/crawldb/.old.crc because it does not exist
>   2006-11-14 12:06:13,603 WARN  dfs.StateChange - DIR*  
> FSDirectory.unprotectedDelete: failed to remove /user/crawler/ 
> crawl-20061114114341/crawldb/old because it does not exist
>   2006-11-14 12:06:13,604 WARN  dfs.StateChange - DIR*  
> FSDirectory.unprotectedRenameTo: failed to rename /user/crawler/ 
> crawl-20061114114341/crawldb/current to /user/crawler/ 
> crawl-20061114114341/crawldb/old because source does not exist
>   2006-11-14 12:06:13,607 WARN  dfs.StateChange - DIR*  
> FSDirectory.unprotectedDelete: failed to remove /user/crawler/ 
> crawl-20061114114341/crawldb/.old.crc because it does not exist
>   2006-11-14 12:06:13,608 WARN  dfs.StateChange - DIR*  
> FSDirectory.unprotectedDelete: failed to remove /user/crawler/ 
> crawl-20061114114341/crawldb/old because it does not exist
>
> Once the fetcher job got going
>   2006-11-14 12:21:30,654 INFO  fetcher.Fetcher - Fetcher:  
> segment: /user/crawler/crawl-20061114114341/segments/20061114120615
> Andrzej's message queue subsystem apparently had tons of trouble  
> creating
> various files, with messages getting spit out every minute or so:
>
>   2006-11-14 12:21:54,827 WARN  dfs.StateChange - DIR*  
> NameSystem.startFile: failed to create file /mq/alerts/fetcher/. 
> 1163535714813.s7-crawlera.crc for DFSClient_task_0005_m_000010_0 on  
> client s7-crawlera because pendingCreates is non-null.
>   ...
>   2006-11-14 13:10:50,457 WARN  dfs.StateChange - DIR*  
> NameSystem.startFile: failed to create file /mq/alerts/fetcher/. 
> 1163538650385.s7-crawlera.crc for DFSClient_task_0005_m_000019_0 on  
> client s7-crawlera because pendingCreates is non-null.
>
> These may be correlated with entries in the datanode logs:
>
>   2006-11-14 13:10:46,387 WARN  dfs.DataNode - DataXCeiver
>   java.io.IOException: Block blk_-8970705341808748626 is valid, and  
> cannot be written to.
>           at org.apache.hadoop.dfs.FSDataset.writeToBlock 
> (FSDataset.java:285)
>           at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock 
> (DataNode.java:663)
>           at org.apache.hadoop.dfs.DataNode$DataXceiver.run 
> (DataNode.java:507)
>           at java.lang.Thread.run(Thread.java:595)
>
> The namenode also noticed a few redundant addStoredBlock requests:
>
>   2006-11-14 13:10:56,103 WARN  dfs.StateChange - BLOCK*  
> NameSystem.addStoredBlock: Redundant addStoredBlock request  
> received for blk_-4818305081009377296 on s10-crawlera:50010
>
> But I *don't* find anything in the datanode logs mentioning these  
> blocks.
>
> Finally (perhaps in response to me killing this first fetch job?),  
> we see our
> first ClosedChannelException in the namenode log:
>
>   2006-11-14 13:40:44,634 WARN  ipc.Server - handler output error
>   java.nio.channels.ClosedChannelException
>     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen 
> (SocketChannelImpl.java:125)
>     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
>     at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer 
> (SocketChannelOutputStream.java:120)
>     at org.apache.hadoop.ipc.SocketChannelOutputStream.write 
> (SocketChannelOutputStream.java:93)
>     at java.io.BufferedOutputStream.flushBuffer 
> (BufferedOutputStream.java:65)
>     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
> 123)
>     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>
> This is associated with "unknown opcode" messages in most of the  
> datanode logs:
>
>   s3-crawlera.prod.krugle.net: 2006-11-14 13:40:43,678 WARN  
> dfs.DataNode - DataXCeiver
>   s3-crawlera.prod.krugle.net: java.io.IOException: Unknown opcode  
> for incoming data stream
>   s3-crawlera.prod.krugle.net:    at org.apache.hadoop.dfs.DataNode 
> $DataXceiver.run(DataNode.java:515)
>   s3-crawlera.prod.krugle.net:    at java.lang.Thread.run 
> (Thread.java:595)
>
>
> ---------------------------------------------------------------------- 
> ----------
> Here I did a stop-all.sh, deployed a version of nutch with a bit  
> more logging,
> and then did a start-all.sh
> ---------------------------------------------------------------------- 
> ----------
>
>
> The jobtracker log shows that it was apparently unable to connect  
> to the
> namenode just after it gets launched the second time. The  
> connection error
> repeats over and over again, but eventually we get a few messages  
> about the
> web server starting up that may indicate that the jobtracker was  
> finally able
> to connect to the namenode(?)
>
>   2006-11-14 13:59:52,321 WARN  mapred.JobTracker - Starting tracker
>   java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>     at java.net.PlainSocketImpl.connectToAddress 
> (PlainSocketImpl.java:195)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:507)
>     at java.net.Socket.connect(Socket.java:457)
>     at java.net.Socket.<init>(Socket.java:365)
>     at java.net.Socket.<init>(Socket.java:207)
>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
>     at org.apache.hadoop.ipc.Client.call(Client.java:297)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>     at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown  
> Source)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
>     at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
>     at org.apache.hadoop.dfs.DistributedFileSystem.<init> 
> (DistributedFileSystem.java:47)
>     at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
>     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
>     at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
>     at org.apache.hadoop.mapred.JobTracker.startTracker 
> (JobTracker.java:68)
>     at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
>   ...
>   2006-11-14 14:00:18,352 WARN  mapred.JobTracker - Starting tracker
>   java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>     at java.net.PlainSocketImpl.connectToAddress 
> (PlainSocketImpl.java:195)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:507)
>     at java.net.Socket.connect(Socket.java:457)
>     at java.net.Socket.<init>(Socket.java:365)
>     at java.net.Socket.<init>(Socket.java:207)
>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
>     at org.apache.hadoop.ipc.Client.call(Client.java:297)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>     at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown  
> Source)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
>     at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
>     at org.apache.hadoop.dfs.DistributedFileSystem.<init> 
> (DistributedFileSystem.java:47)
>     at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
>     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
>     at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
>     at org.apache.hadoop.mapred.JobTracker.startTracker 
> (JobTracker.java:68)
>     at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
>   2006-11-14 14:00:19,516 INFO  util.Credential - Checking Resource  
> aliases
>   2006-11-14 14:00:19,534 INFO  http.HttpServer - Version Jetty/5.1.4
>   2006-11-14 14:00:20,423 INFO  util.Container - Started  
> org.mortbay.jetty.servlet.WebApplicationHandler@69267649
>   2006-11-14 14:00:20,492 INFO  util.Container - Started  
> WebApplicationContext[/,/]
>   2006-11-14 14:00:20,493 INFO  util.Container - Started HttpContext 
> [/logs,/logs]
>   2006-11-14 14:00:20,493 INFO  util.Container - Started HttpContext 
> [/static,/static]
>   2006-11-14 14:00:20,496 INFO  http.SocketListener - Started  
> SocketListener on 0.0.0.0:50030
>   2006-11-14 14:00:20,496 INFO  util.Container - Started  
> org.mortbay.jetty.Server@b6e39f
>
> Once the second crawl gets started
>   2006-11-14 14:02:03,392 INFO  crawl.Injector - Injector: starting
> the same issues with the new crawldb, the message queue, and  
> redundant blocks
> mentioned above appear again in the namenode log (not shown).
>
> The DFS/RPC problems finally begin to surface in the jobtracker log  
> (and these
> bubble up to tool output) several hours into the second crawl:
>
>   2006-11-14 17:07:27,539 WARN  fs.DFSClient - Problem renewing  
> lease for DFSClient_-284911018: java.net.SocketTimeoutException:  
> timed out waiting for rpc response
>     at org.apache.hadoop.ipc.Client.call(Client.java:312)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>     at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
>     at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run 
> (DFSClient.java:440)
>     at java.lang.Thread.run(Thread.java:595)
>
> Nothing special appears during this same timeframe in the namenode  
> log (just
> the continuing message queue and redundant blocks errors).
>
> Finally, there is an indication in the jobtracker log (embedded in the
> continuing DFS/RPC problems) that the fetch job has died and is  
> being garbage
> collected:
>
>   2006-11-14 18:46:37,681 WARN  fs.DFSClient - Problem renewing  
> lease for DFSClient_-284911018: java.net.SocketTimeoutException:  
> timed out waiting for rpc response
>     at org.apache.hadoop.ipc.Client.call(Client.java:312)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>     at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
>     at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run 
> (DFSClient.java:440)
>     at java.lang.Thread.run(Thread.java:595)
>
>   2006-11-14 18:47:37,339 WARN  mapred.JobInProgress - Error  
> cleaning up job_0005: java.net.SocketTimeoutException: timed out  
> waiting for rpc response
>   2006-11-14 18:47:38,696 WARN  fs.DFSClient - Problem renewing  
> lease for DFSClient_-284911018: java.net.SocketTimeoutException:  
> timed out waiting for rpc response
>     at org.apache.hadoop.ipc.Client.call(Client.java:312)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>     at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
>     at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run 
> (DFSClient.java:440)
>     at java.lang.Thread.run(Thread.java:595)
>
> About 20 minutes later, we finally see the first of a series of  
> broken pipe
> exceptions in the namenode log:
>
>   2006-11-14 19:09:54,440 WARN  ipc.Server - handler output error
>   java.io.IOException: Broken pipe
>     at sun.nio.ch.FileDispatcher.write0(Native Method)
>     at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>     at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
>     at sun.nio.ch.IOUtil.write(IOUtil.java:60)
>     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:302)
>     at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer 
> (SocketChannelOutputStream.java:120)
>     at org.apache.hadoop.ipc.SocketChannelOutputStream.write 
> (SocketChannelOutputStream.java:93)
>     at java.io.BufferedOutputStream.flushBuffer 
> (BufferedOutputStream.java:65)
>     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
> 123)
>     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>
> About 30 minutes later, the first of a series of closed channel  
> exceptions
> appears in the namenode log (along with an indication that the  
> namenode may
> have lost connection to all but one of its datanodes):
>
>   2006-11-14 19:39:43,142 WARN  ipc.Server - handler output error
>   java.nio.channels.ClosedChannelException
>     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen 
> (SocketChannelImpl.java:125)
>     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
>     at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer 
> (SocketChannelOutputStream.java:120)
>     at org.apache.hadoop.ipc.SocketChannelOutputStream.write 
> (SocketChannelOutputStream.java:93)
>     at java.io.BufferedOutputStream.flushBuffer 
> (BufferedOutputStream.java:65)
>     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
> 123)
>     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>   2006-11-14 19:39:43,142 WARN  fs.FSNamesystem - Replication  
> requested of 3 is larger than cluster size (1). Using cluster size.
>
> The namenode process eventually dies, though the jobtracker  
> survives (as do
> all of the datanodes and tasktrackers on the slaves).
> -- 
> ------------------------
> Chris Schneider
> TransPac Software, Inc.
> Schmed@TransPac.com
> ------------------------