You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Chris Schneider <Sc...@TransPac.com> on 2006/11/15 19:52:26 UTC
DFS/RPC problems
Fellow Hadoop coders,
I'm getting tragic DFS/RPC problems running Nutch 0.8+ atop Hadoop
0.5.1+ that seem similar to problems others have been reporting
(e.g., HADOOP-707). My great hope was that the HADOOP-563 patch would
resolve them, and it may well have improved things somewhat. However,
I'm still unable to complete the crawl I've been attempting even
after rolling in this patch. A few notes about my setup that may be
relevant:
1) I'm running a crawl through a proxy server, and I've modified
Nutch so that it skips the server delay whenever it gets a hit in the
proxy cache. Since I've been attempting the same crawl over and over
again, the first part of it may run so fast that it taxes the DFS and
RPC in a way that exposes fundamental problems.
2) I'm injecting 26M URLs into a brand new DB and then trying to
fetch them all at once. I've tuned the fetcher.threads.fetch to 500
in order to avoid fatal memory swapping on my cluster of 10 slaves (2
tasktrackers per slave).
3) I've been using hadoop job -kill to stop the fetch job when I
notice things not going the way I want (not necessarily catastrophic
problems). I wonder whether this (in combination with the heavy
loading I've applied to hadoop) may result in my DFS becoming
"unusable".
4) Once my DFS becomes "unusable", I can stop-all.sh/start-all.sh,
reboot my systems, etc., but nothing will give me back a cluster I
can reliably use to do even relatively simple things like hadoop dfs
-du. The start-all.sh script takes a very long time to execute, but
hadoop eventually does come up, including its web server. The hadoop
fsck / command (unless it dies do to an RPC problem) also reports
that my DFS is healthy, though things are very sick indeed. Even
simple DFS commands either execute slowly or die with RPC problems
(SocketTimeoutException). Crawling at this point may also get started
OK, though the fetch will eventually die.
5) Part of the modifications we've made to Nutch are a message queue
subsystem developed by Andrzej. Since this doesn't use NIO to do any
of its communication, I doubt that it is interfering with hadoop's
RPC. However, it does a fair amount of writing to the DFS itself, and
many of the error messages in the logs are associated with these
requests.
6) To resolve the "unusable" DFS problem, I'm forced to use the local
FS to delete all of the DFS contents (on both master and slaves), and
then I tell the namenode to format the DFS. I also wonder whether
this operation may leave things in some kind of strange state, though
I can't imagine what might be the problem.
After resorting to step #6 above yesterday, I performed another test
and conducted a fairly detailed analysis of the log files. I injected
my 26M URLs into a new crawldb, generated a fetch list (without
specifying any topN) and started fetching it. After over an hour of
fetching, I noticed that I wanted to do a little more logging on the
Nutch side, so I killed the fetch job, did a stop-all.sh, deployed a
modified version of Nutch, did a start-all.sh, and tried my crawl
again from scratch (i.e., I injected my 26M URLs into a brand new
crawldb, etc.) I let this run until the fetch job died, then mined
the logs for the following information.
Any help our insight you could provide would be greatly appreciated.
Best Regards,
- Chris
-------------------------------------------------------------------------------------------
I didn't find anything unusual in the stdout for start-all.sh.
The namenode may have had trouble removing files immediately after it was first
launched, but this is probably just housekeeping (making sure that it deletes
any files that are sometimes left over from previous runs):
2006-11-14 11:13:23,276 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/tmp/hadoop/mapred/.system.crc because it does not exist
2006-11-14 11:13:23,280 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/tmp/hadoop/mapred/system because it does not exist
Immediately after the injection for the first crawl began
2006-11-14 11:45:45,141 INFO crawl.Injector - Injector: starting
there is a significant(?) DFS block issue in the s1 datanode's log:
2006-11-14 11:45:49,021 WARN dfs.DataNode - DataXCeiver
java.io.IOException: Block blk_452376043108156205 has already been
started (though not completed), and thus cannot be created.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:298)
at
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:663)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:507)
at java.lang.Thread.run(Thread.java:595)
This problem is reflected in the requesting (s7) datanode's log as well and
apparently results in its connection to the namenode getting reset(?):
2006-11-14 11:45:49,010 WARN dfs.DataNode - Failed to transfer
blk_452376043108156205 to s1-crawlera/10.10.16.101:50010
java.net.SocketException: Connection reset
at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:912)
at java.lang.Thread.run(Thread.java:595)
(There are three of these events in s7's datanode log during the injection,
generation and fetching, which seem to be correlated with some of the message
queue subsystem problems mentioned below.)
As the injection for the first crawl completed
2006-11-14 12:06:13,613 INFO crawl.Injector - Injector: done
there are a couple of unimportant(?) messages that I can attribute to a brand
new crawldb:
2006-11-14 12:06:13,603 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/user/crawler/crawl-20061114114341/crawldb/.old.crc because it does
not exist
2006-11-14 12:06:13,603 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/user/crawler/crawl-20061114114341/crawldb/old because it does not
exist
2006-11-14 12:06:13,604 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedRenameTo: failed to rename
/user/crawler/crawl-20061114114341/crawldb/current to
/user/crawler/crawl-20061114114341/crawldb/old because source does
not exist
2006-11-14 12:06:13,607 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/user/crawler/crawl-20061114114341/crawldb/.old.crc because it does
not exist
2006-11-14 12:06:13,608 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/user/crawler/crawl-20061114114341/crawldb/old because it does not
exist
Once the fetcher job got going
2006-11-14 12:21:30,654 INFO fetcher.Fetcher - Fetcher: segment:
/user/crawler/crawl-20061114114341/segments/20061114120615
Andrzej's message queue subsystem apparently had tons of trouble creating
various files, with messages getting spit out every minute or so:
2006-11-14 12:21:54,827 WARN dfs.StateChange - DIR*
NameSystem.startFile: failed to create file
/mq/alerts/fetcher/.1163535714813.s7-crawlera.crc for
DFSClient_task_0005_m_000010_0 on client s7-crawlera because
pendingCreates is non-null.
...
2006-11-14 13:10:50,457 WARN dfs.StateChange - DIR*
NameSystem.startFile: failed to create file
/mq/alerts/fetcher/.1163538650385.s7-crawlera.crc for
DFSClient_task_0005_m_000019_0 on client s7-crawlera because
pendingCreates is non-null.
These may be correlated with entries in the datanode logs:
2006-11-14 13:10:46,387 WARN dfs.DataNode - DataXCeiver
java.io.IOException: Block blk_-8970705341808748626 is valid, and
cannot be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:285)
at
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:663)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:507)
at java.lang.Thread.run(Thread.java:595)
The namenode also noticed a few redundant addStoredBlock requests:
2006-11-14 13:10:56,103 WARN dfs.StateChange - BLOCK*
NameSystem.addStoredBlock: Redundant addStoredBlock request received
for blk_-4818305081009377296 on s10-crawlera:50010
But I *don't* find anything in the datanode logs mentioning these blocks.
Finally (perhaps in response to me killing this first fetch job?), we see our
first ClosedChannelException in the namenode log:
2006-11-14 13:40:44,634 WARN ipc.Server - handler output error
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:125)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
at
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
at
org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
This is associated with "unknown opcode" messages in most of the datanode logs:
s3-crawlera.prod.krugle.net: 2006-11-14 13:40:43,678 WARN
dfs.DataNode - DataXCeiver
s3-crawlera.prod.krugle.net: java.io.IOException: Unknown opcode
for incoming data stream
s3-crawlera.prod.krugle.net: at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:515)
s3-crawlera.prod.krugle.net: at java.lang.Thread.run(Thread.java:595)
--------------------------------------------------------------------------------
Here I did a stop-all.sh, deployed a version of nutch with a bit more logging,
and then did a start-all.sh
--------------------------------------------------------------------------------
The jobtracker log shows that it was apparently unable to connect to the
namenode just after it gets launched the second time. The connection error
repeats over and over again, but eventually we get a few messages about the
web server starting up that may indicate that the jobtracker was finally able
to connect to the namenode(?)
2006-11-14 13:59:52,321 WARN mapred.JobTracker - Starting tracker
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:507)
at java.net.Socket.connect(Socket.java:457)
at java.net.Socket.<init>(Socket.java:365)
at java.net.Socket.<init>(Socket.java:207)
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
at org.apache.hadoop.ipc.Client.call(Client.java:297)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
at
org.apache.hadoop.dfs.DistributedFileSystem.<init>(DistributedFileSystem.java:47)
at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
...
2006-11-14 14:00:18,352 WARN mapred.JobTracker - Starting tracker
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:507)
at java.net.Socket.connect(Socket.java:457)
at java.net.Socket.<init>(Socket.java:365)
at java.net.Socket.<init>(Socket.java:207)
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
at org.apache.hadoop.ipc.Client.call(Client.java:297)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
at
org.apache.hadoop.dfs.DistributedFileSystem.<init>(DistributedFileSystem.java:47)
at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
2006-11-14 14:00:19,516 INFO util.Credential - Checking Resource aliases
2006-11-14 14:00:19,534 INFO http.HttpServer - Version Jetty/5.1.4
2006-11-14 14:00:20,423 INFO util.Container - Started
org.mortbay.jetty.servlet.WebApplicationHandler@69267649
2006-11-14 14:00:20,492 INFO util.Container - Started
WebApplicationContext[/,/]
2006-11-14 14:00:20,493 INFO util.Container - Started
HttpContext[/logs,/logs]
2006-11-14 14:00:20,493 INFO util.Container - Started
HttpContext[/static,/static]
2006-11-14 14:00:20,496 INFO http.SocketListener - Started
SocketListener on 0.0.0.0:50030
2006-11-14 14:00:20,496 INFO util.Container - Started
org.mortbay.jetty.Server@b6e39f
Once the second crawl gets started
2006-11-14 14:02:03,392 INFO crawl.Injector - Injector: starting
the same issues with the new crawldb, the message queue, and redundant blocks
mentioned above appear again in the namenode log (not shown).
The DFS/RPC problems finally begin to surface in the jobtracker log (and these
bubble up to tool output) several hours into the second crawl:
2006-11-14 17:07:27,539 WARN fs.DFSClient - Problem renewing lease
for DFSClient_-284911018: java.net.SocketTimeoutException: timed out
waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:312)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
at java.lang.Thread.run(Thread.java:595)
Nothing special appears during this same timeframe in the namenode log (just
the continuing message queue and redundant blocks errors).
Finally, there is an indication in the jobtracker log (embedded in the
continuing DFS/RPC problems) that the fetch job has died and is being garbage
collected:
2006-11-14 18:46:37,681 WARN fs.DFSClient - Problem renewing lease
for DFSClient_-284911018: java.net.SocketTimeoutException: timed out
waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:312)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
at java.lang.Thread.run(Thread.java:595)
2006-11-14 18:47:37,339 WARN mapred.JobInProgress - Error cleaning
up job_0005: java.net.SocketTimeoutException: timed out waiting for
rpc response
2006-11-14 18:47:38,696 WARN fs.DFSClient - Problem renewing lease
for DFSClient_-284911018: java.net.SocketTimeoutException: timed out
waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:312)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
at java.lang.Thread.run(Thread.java:595)
About 20 minutes later, we finally see the first of a series of broken pipe
exceptions in the namenode log:
2006-11-14 19:09:54,440 WARN ipc.Server - handler output error
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcher.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
at sun.nio.ch.IOUtil.write(IOUtil.java:60)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:302)
at
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
at
org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
About 30 minutes later, the first of a series of closed channel exceptions
appears in the namenode log (along with an indication that the namenode may
have lost connection to all but one of its datanodes):
2006-11-14 19:39:43,142 WARN ipc.Server - handler output error
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:125)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
at
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
at
org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
2006-11-14 19:39:43,142 WARN fs.FSNamesystem - Replication
requested of 3 is larger than cluster size (1). Using cluster size.
The namenode process eventually dies, though the jobtracker survives (as do
all of the datanodes and tasktrackers on the slaves).
--
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------
Re: DFS/RPC problems
Posted by Chris Schneider <Sc...@TransPac.com>.
Eric,
>How many files are you creating in HDFS?
>
>I wonder if the messaging package may simply be swamping the namenode with many, many small files? It really is not designed to deal with that use case. How big are your checkpoint files?
Thanks for your excellent suggestion! It appears that this was indeed causing the problem. I will work with Andrzej to come up with a solution to limit the number of files created by his message queue subsystem.
Best Regards,
- Chris
>On Nov 15, 2006, at 10:52 AM, Chris Schneider wrote:
>
>>Fellow Hadoop coders,
>>
>>I'm getting tragic DFS/RPC problems running Nutch 0.8+ atop Hadoop 0.5.1+ that seem similar to problems others have been reporting (e.g., HADOOP-707). My great hope was that the HADOOP-563 patch would resolve them, and it may well have improved things somewhat. However, I'm still unable to complete the crawl I've been attempting even after rolling in this patch. A few notes about my setup that may be relevant:
>>
>>1) I'm running a crawl through a proxy server, and I've modified Nutch so that it skips the server delay whenever it gets a hit in the proxy cache. Since I've been attempting the same crawl over and over again, the first part of it may run so fast that it taxes the DFS and RPC in a way that exposes fundamental problems.
>>
>>2) I'm injecting 26M URLs into a brand new DB and then trying to fetch them all at once. I've tuned the fetcher.threads.fetch to 500 in order to avoid fatal memory swapping on my cluster of 10 slaves (2 tasktrackers per slave).
>>
>>3) I've been using hadoop job -kill to stop the fetch job when I notice things not going the way I want (not necessarily catastrophic problems). I wonder whether this (in combination with the heavy loading I've applied to hadoop) may result in my DFS becoming "unusable".
>>
>>4) Once my DFS becomes "unusable", I can stop-all.sh/start-all.sh, reboot my systems, etc., but nothing will give me back a cluster I can reliably use to do even relatively simple things like hadoop dfs -du. The start-all.sh script takes a very long time to execute, but hadoop eventually does come up, including its web server. The hadoop fsck / command (unless it dies do to an RPC problem) also reports that my DFS is healthy, though things are very sick indeed. Even simple DFS commands either execute slowly or die with RPC problems (SocketTimeoutException). Crawling at this point may also get started OK, though the fetch will eventually die.
>>
>>5) Part of the modifications we've made to Nutch are a message queue subsystem developed by Andrzej. Since this doesn't use NIO to do any of its communication, I doubt that it is interfering with hadoop's RPC. However, it does a fair amount of writing to the DFS itself, and many of the error messages in the logs are associated with these requests.
>>
>>6) To resolve the "unusable" DFS problem, I'm forced to use the local FS to delete all of the DFS contents (on both master and slaves), and then I tell the namenode to format the DFS. I also wonder whether this operation may leave things in some kind of strange state, though I can't imagine what might be the problem.
>>
>>After resorting to step #6 above yesterday, I performed another test and conducted a fairly detailed analysis of the log files. I injected my 26M URLs into a new crawldb, generated a fetch list (without specifying any topN) and started fetching it. After over an hour of fetching, I noticed that I wanted to do a little more logging on the Nutch side, so I killed the fetch job, did a stop-all.sh, deployed a modified version of Nutch, did a start-all.sh, and tried my crawl again from scratch (i.e., I injected my 26M URLs into a brand new crawldb, etc.) I let this run until the fetch job died, then mined the logs for the following information.
>>
>>Any help our insight you could provide would be greatly appreciated.
>>
>>Best Regards,
>>
>>- Chris
>>
>>-------------------------------------------------------------------------------------------
>>I didn't find anything unusual in the stdout for start-all.sh.
>>
>>The namenode may have had trouble removing files immediately after it was first
>>launched, but this is probably just housekeeping (making sure that it deletes
>>any files that are sometimes left over from previous runs):
>>
>> 2006-11-14 11:13:23,276 WARN dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /tmp/hadoop/mapred/.system.crc because it does not exist
>> 2006-11-14 11:13:23,280 WARN dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /tmp/hadoop/mapred/system because it does not exist
>>
>>Immediately after the injection for the first crawl began
>> 2006-11-14 11:45:45,141 INFO crawl.Injector - Injector: starting
>>there is a significant(?) DFS block issue in the s1 datanode's log:
>>
>> 2006-11-14 11:45:49,021 WARN dfs.DataNode - DataXCeiver
>> java.io.IOException: Block blk_452376043108156205 has already been started (though not completed), and thus cannot be created.
>> at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:298)
>> at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:663)
>> at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:507)
>> at java.lang.Thread.run(Thread.java:595)
>>
>>This problem is reflected in the requesting (s7) datanode's log as well and
>>apparently results in its connection to the namenode getting reset(?):
>>
>> 2006-11-14 11:45:49,010 WARN dfs.DataNode - Failed to transfer blk_452376043108156205 to s1-crawlera/10.10.16.101:50010
>> java.net.SocketException: Connection reset
>> at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>> at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:912)
>> at java.lang.Thread.run(Thread.java:595)
>>
>>(There are three of these events in s7's datanode log during the injection,
>>generation and fetching, which seem to be correlated with some of the message
>>queue subsystem problems mentioned below.)
>>
>>As the injection for the first crawl completed
>> 2006-11-14 12:06:13,613 INFO crawl.Injector - Injector: done
>>there are a couple of unimportant(?) messages that I can attribute to a brand
>>new crawldb:
>>
>> 2006-11-14 12:06:13,603 WARN dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /user/crawler/crawl-20061114114341/crawldb/.old.crc because it does not exist
>> 2006-11-14 12:06:13,603 WARN dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /user/crawler/crawl-20061114114341/crawldb/old because it does not exist
>> 2006-11-14 12:06:13,604 WARN dfs.StateChange - DIR* FSDirectory.unprotectedRenameTo: failed to rename /user/crawler/crawl-20061114114341/crawldb/current to /user/crawler/crawl-20061114114341/crawldb/old because source does not exist
>> 2006-11-14 12:06:13,607 WARN dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /user/crawler/crawl-20061114114341/crawldb/.old.crc because it does not exist
>> 2006-11-14 12:06:13,608 WARN dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove /user/crawler/crawl-20061114114341/crawldb/old because it does not exist
>>
>>Once the fetcher job got going
>> 2006-11-14 12:21:30,654 INFO fetcher.Fetcher - Fetcher: segment: /user/crawler/crawl-20061114114341/segments/20061114120615
>>Andrzej's message queue subsystem apparently had tons of trouble creating
>>various files, with messages getting spit out every minute or so:
>>
>> 2006-11-14 12:21:54,827 WARN dfs.StateChange - DIR* NameSystem.startFile: failed to create file /mq/alerts/fetcher/.1163535714813.s7-crawlera.crc for DFSClient_task_0005_m_000010_0 on client s7-crawlera because pendingCreates is non-null.
>> ...
>> 2006-11-14 13:10:50,457 WARN dfs.StateChange - DIR* NameSystem.startFile: failed to create file /mq/alerts/fetcher/.1163538650385.s7-crawlera.crc for DFSClient_task_0005_m_000019_0 on client s7-crawlera because pendingCreates is non-null.
>>
>>These may be correlated with entries in the datanode logs:
>>
>> 2006-11-14 13:10:46,387 WARN dfs.DataNode - DataXCeiver
>> java.io.IOException: Block blk_-8970705341808748626 is valid, and cannot be written to.
>> at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:285)
>> at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:663)
>> at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:507)
>> at java.lang.Thread.run(Thread.java:595)
>>
>>The namenode also noticed a few redundant addStoredBlock requests:
>>
>> 2006-11-14 13:10:56,103 WARN dfs.StateChange - BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_-4818305081009377296 on s10-crawlera:50010
>>
>>But I *don't* find anything in the datanode logs mentioning these blocks.
>>
>>Finally (perhaps in response to me killing this first fetch job?), we see our
>>first ClosedChannelException in the namenode log:
>>
>> 2006-11-14 13:40:44,634 WARN ipc.Server - handler output error
>> java.nio.channels.ClosedChannelException
>> at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:125)
>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
>> at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
>> at org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
>> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>>
>>This is associated with "unknown opcode" messages in most of the datanode logs:
>>
>> s3-crawlera.prod.krugle.net: 2006-11-14 13:40:43,678 WARN dfs.DataNode - DataXCeiver
>> s3-crawlera.prod.krugle.net: java.io.IOException: Unknown opcode for incoming data stream
>> s3-crawlera.prod.krugle.net: at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:515)
>> s3-crawlera.prod.krugle.net: at java.lang.Thread.run(Thread.java:595)
>>
>>
>>--------------------------------------------------------------------------------
>>Here I did a stop-all.sh, deployed a version of nutch with a bit more logging,
>>and then did a start-all.sh
>>--------------------------------------------------------------------------------
>>
>>
>>The jobtracker log shows that it was apparently unable to connect to the
>>namenode just after it gets launched the second time. The connection error
>>repeats over and over again, but eventually we get a few messages about the
>>web server starting up that may indicate that the jobtracker was finally able
>>to connect to the namenode(?)
>>
>> 2006-11-14 13:59:52,321 WARN mapred.JobTracker - Starting tracker
>> java.net.ConnectException: Connection refused
>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>> at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>> at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>> at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>> at java.net.Socket.connect(Socket.java:507)
>> at java.net.Socket.connect(Socket.java:457)
>> at java.net.Socket.<init>(Socket.java:365)
>> at java.net.Socket.<init>(Socket.java:207)
>> at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
>> at org.apache.hadoop.ipc.Client.call(Client.java:297)
>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>> at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
>> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
>> at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
>> at org.apache.hadoop.dfs.DistributedFileSystem.<init>(DistributedFileSystem.java:47)
>> at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
>> at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
>> at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
>> at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
>> ...
>> 2006-11-14 14:00:18,352 WARN mapred.JobTracker - Starting tracker
>> java.net.ConnectException: Connection refused
>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>> at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>> at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>> at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>> at java.net.Socket.connect(Socket.java:507)
>> at java.net.Socket.connect(Socket.java:457)
>> at java.net.Socket.<init>(Socket.java:365)
>> at java.net.Socket.<init>(Socket.java:207)
>> at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
>> at org.apache.hadoop.ipc.Client.call(Client.java:297)
>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>> at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
>> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
>> at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
>> at org.apache.hadoop.dfs.DistributedFileSystem.<init>(DistributedFileSystem.java:47)
>> at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
>> at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
>> at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
>> at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
>> 2006-11-14 14:00:19,516 INFO util.Credential - Checking Resource aliases
>> 2006-11-14 14:00:19,534 INFO http.HttpServer - Version Jetty/5.1.4
>> 2006-11-14 14:00:20,423 INFO util.Container - Started org.mortbay.jetty.servlet.WebApplicationHandler@69267649
>> 2006-11-14 14:00:20,492 INFO util.Container - Started WebApplicationContext[/,/]
>> 2006-11-14 14:00:20,493 INFO util.Container - Started HttpContext[/logs,/logs]
>> 2006-11-14 14:00:20,493 INFO util.Container - Started HttpContext[/static,/static]
>> 2006-11-14 14:00:20,496 INFO http.SocketListener - Started SocketListener on 0.0.0.0:50030
>> 2006-11-14 14:00:20,496 INFO util.Container - Started org.mortbay.jetty.Server@b6e39f
>>
>>Once the second crawl gets started
>> 2006-11-14 14:02:03,392 INFO crawl.Injector - Injector: starting
>>the same issues with the new crawldb, the message queue, and redundant blocks
>>mentioned above appear again in the namenode log (not shown).
>>
>>The DFS/RPC problems finally begin to surface in the jobtracker log (and these
>>bubble up to tool output) several hours into the second crawl:
>>
>> 2006-11-14 17:07:27,539 WARN fs.DFSClient - Problem renewing lease for DFSClient_-284911018: java.net.SocketTimeoutException: timed out waiting for rpc response
>> at org.apache.hadoop.ipc.Client.call(Client.java:312)
>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>> at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
>> at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
>> at java.lang.Thread.run(Thread.java:595)
>>
>>Nothing special appears during this same timeframe in the namenode log (just
>>the continuing message queue and redundant blocks errors).
>>
>>Finally, there is an indication in the jobtracker log (embedded in the
>>continuing DFS/RPC problems) that the fetch job has died and is being garbage
>>collected:
>>
>> 2006-11-14 18:46:37,681 WARN fs.DFSClient - Problem renewing lease for DFSClient_-284911018: java.net.SocketTimeoutException: timed out waiting for rpc response
>> at org.apache.hadoop.ipc.Client.call(Client.java:312)
>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>> at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
>> at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
>> at java.lang.Thread.run(Thread.java:595)
>>
>> 2006-11-14 18:47:37,339 WARN mapred.JobInProgress - Error cleaning up job_0005: java.net.SocketTimeoutException: timed out waiting for rpc response
>> 2006-11-14 18:47:38,696 WARN fs.DFSClient - Problem renewing lease for DFSClient_-284911018: java.net.SocketTimeoutException: timed out waiting for rpc response
>> at org.apache.hadoop.ipc.Client.call(Client.java:312)
>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>> at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
>> at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
>> at java.lang.Thread.run(Thread.java:595)
>>
>>About 20 minutes later, we finally see the first of a series of broken pipe
>>exceptions in the namenode log:
>>
>> 2006-11-14 19:09:54,440 WARN ipc.Server - handler output error
>> java.io.IOException: Broken pipe
>> at sun.nio.ch.FileDispatcher.write0(Native Method)
>> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
>> at sun.nio.ch.IOUtil.write(IOUtil.java:60)
>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:302)
>> at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
>> at org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
>> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>>
>>About 30 minutes later, the first of a series of closed channel exceptions
>>appears in the namenode log (along with an indication that the namenode may
>>have lost connection to all but one of its datanodes):
>>
>> 2006-11-14 19:39:43,142 WARN ipc.Server - handler output error
>> java.nio.channels.ClosedChannelException
>> at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:125)
>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
>> at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
>> at org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
>> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>> 2006-11-14 19:39:43,142 WARN fs.FSNamesystem - Replication requested of 3 is larger than cluster size (1). Using cluster size.
>>
>>The namenode process eventually dies, though the jobtracker survives (as do
>>all of the datanodes and tasktrackers on the slaves).
>>--
>>------------------------
>>Chris Schneider
>>TransPac Software, Inc.
>>Schmed@TransPac.com
>>------------------------
--
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------
Re: DFS/RPC problems
Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.
How many files are you creating in HDFS?
I wonder if the messaging package may simply be swamping the namenode
with many, many small files? It really is not designed to deal with
that use case. How big are your checkpoint files?
On Nov 15, 2006, at 10:52 AM, Chris Schneider wrote:
> Fellow Hadoop coders,
>
> I'm getting tragic DFS/RPC problems running Nutch 0.8+ atop Hadoop
> 0.5.1+ that seem similar to problems others have been reporting
> (e.g., HADOOP-707). My great hope was that the HADOOP-563 patch
> would resolve them, and it may well have improved things somewhat.
> However, I'm still unable to complete the crawl I've been
> attempting even after rolling in this patch. A few notes about my
> setup that may be relevant:
>
> 1) I'm running a crawl through a proxy server, and I've modified
> Nutch so that it skips the server delay whenever it gets a hit in
> the proxy cache. Since I've been attempting the same crawl over and
> over again, the first part of it may run so fast that it taxes the
> DFS and RPC in a way that exposes fundamental problems.
>
> 2) I'm injecting 26M URLs into a brand new DB and then trying to
> fetch them all at once. I've tuned the fetcher.threads.fetch to 500
> in order to avoid fatal memory swapping on my cluster of 10 slaves
> (2 tasktrackers per slave).
>
> 3) I've been using hadoop job -kill to stop the fetch job when I
> notice things not going the way I want (not necessarily
> catastrophic problems). I wonder whether this (in combination with
> the heavy loading I've applied to hadoop) may result in my DFS
> becoming "unusable".
>
> 4) Once my DFS becomes "unusable", I can stop-all.sh/start-all.sh,
> reboot my systems, etc., but nothing will give me back a cluster I
> can reliably use to do even relatively simple things like hadoop
> dfs -du. The start-all.sh script takes a very long time to execute,
> but hadoop eventually does come up, including its web server. The
> hadoop fsck / command (unless it dies do to an RPC problem) also
> reports that my DFS is healthy, though things are very sick indeed.
> Even simple DFS commands either execute slowly or die with RPC
> problems (SocketTimeoutException). Crawling at this point may also
> get started OK, though the fetch will eventually die.
>
> 5) Part of the modifications we've made to Nutch are a message
> queue subsystem developed by Andrzej. Since this doesn't use NIO to
> do any of its communication, I doubt that it is interfering with
> hadoop's RPC. However, it does a fair amount of writing to the DFS
> itself, and many of the error messages in the logs are associated
> with these requests.
>
> 6) To resolve the "unusable" DFS problem, I'm forced to use the
> local FS to delete all of the DFS contents (on both master and
> slaves), and then I tell the namenode to format the DFS. I also
> wonder whether this operation may leave things in some kind of
> strange state, though I can't imagine what might be the problem.
>
> After resorting to step #6 above yesterday, I performed another
> test and conducted a fairly detailed analysis of the log files. I
> injected my 26M URLs into a new crawldb, generated a fetch list
> (without specifying any topN) and started fetching it. After over
> an hour of fetching, I noticed that I wanted to do a little more
> logging on the Nutch side, so I killed the fetch job, did a stop-
> all.sh, deployed a modified version of Nutch, did a start-all.sh,
> and tried my crawl again from scratch (i.e., I injected my 26M URLs
> into a brand new crawldb, etc.) I let this run until the fetch job
> died, then mined the logs for the following information.
>
> Any help our insight you could provide would be greatly appreciated.
>
> Best Regards,
>
> - Chris
>
> ----------------------------------------------------------------------
> ---------------------
> I didn't find anything unusual in the stdout for start-all.sh.
>
> The namenode may have had trouble removing files immediately after
> it was first
> launched, but this is probably just housekeeping (making sure that
> it deletes
> any files that are sometimes left over from previous runs):
>
> 2006-11-14 11:13:23,276 WARN dfs.StateChange - DIR*
> FSDirectory.unprotectedDelete: failed to remove /tmp/hadoop/
> mapred/.system.crc because it does not exist
> 2006-11-14 11:13:23,280 WARN dfs.StateChange - DIR*
> FSDirectory.unprotectedDelete: failed to remove /tmp/hadoop/mapred/
> system because it does not exist
>
> Immediately after the injection for the first crawl began
> 2006-11-14 11:45:45,141 INFO crawl.Injector - Injector: starting
> there is a significant(?) DFS block issue in the s1 datanode's log:
>
> 2006-11-14 11:45:49,021 WARN dfs.DataNode - DataXCeiver
> java.io.IOException: Block blk_452376043108156205 has already
> been started (though not completed), and thus cannot be created.
> at org.apache.hadoop.dfs.FSDataset.writeToBlock
> (FSDataset.java:298)
> at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock
> (DataNode.java:663)
> at org.apache.hadoop.dfs.DataNode$DataXceiver.run
> (DataNode.java:507)
> at java.lang.Thread.run(Thread.java:595)
>
> This problem is reflected in the requesting (s7) datanode's log as
> well and
> apparently results in its connection to the namenode getting reset(?):
>
> 2006-11-14 11:45:49,010 WARN dfs.DataNode - Failed to transfer
> blk_452376043108156205 to s1-crawlera/10.10.16.101:50010
> java.net.SocketException: Connection reset
> at java.net.SocketOutputStream.socketWrite
> (SocketOutputStream.java:96)
> at java.net.SocketOutputStream.write
> (SocketOutputStream.java:136)
> at java.io.BufferedOutputStream.flushBuffer
> (BufferedOutputStream.java:65)
> at java.io.BufferedOutputStream.write
> (BufferedOutputStream.java:109)
> at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at org.apache.hadoop.dfs.DataNode$DataTransfer.run
> (DataNode.java:912)
> at java.lang.Thread.run(Thread.java:595)
>
> (There are three of these events in s7's datanode log during the
> injection,
> generation and fetching, which seem to be correlated with some of
> the message
> queue subsystem problems mentioned below.)
>
> As the injection for the first crawl completed
> 2006-11-14 12:06:13,613 INFO crawl.Injector - Injector: done
> there are a couple of unimportant(?) messages that I can attribute
> to a brand
> new crawldb:
>
> 2006-11-14 12:06:13,603 WARN dfs.StateChange - DIR*
> FSDirectory.unprotectedDelete: failed to remove /user/crawler/
> crawl-20061114114341/crawldb/.old.crc because it does not exist
> 2006-11-14 12:06:13,603 WARN dfs.StateChange - DIR*
> FSDirectory.unprotectedDelete: failed to remove /user/crawler/
> crawl-20061114114341/crawldb/old because it does not exist
> 2006-11-14 12:06:13,604 WARN dfs.StateChange - DIR*
> FSDirectory.unprotectedRenameTo: failed to rename /user/crawler/
> crawl-20061114114341/crawldb/current to /user/crawler/
> crawl-20061114114341/crawldb/old because source does not exist
> 2006-11-14 12:06:13,607 WARN dfs.StateChange - DIR*
> FSDirectory.unprotectedDelete: failed to remove /user/crawler/
> crawl-20061114114341/crawldb/.old.crc because it does not exist
> 2006-11-14 12:06:13,608 WARN dfs.StateChange - DIR*
> FSDirectory.unprotectedDelete: failed to remove /user/crawler/
> crawl-20061114114341/crawldb/old because it does not exist
>
> Once the fetcher job got going
> 2006-11-14 12:21:30,654 INFO fetcher.Fetcher - Fetcher:
> segment: /user/crawler/crawl-20061114114341/segments/20061114120615
> Andrzej's message queue subsystem apparently had tons of trouble
> creating
> various files, with messages getting spit out every minute or so:
>
> 2006-11-14 12:21:54,827 WARN dfs.StateChange - DIR*
> NameSystem.startFile: failed to create file /mq/alerts/fetcher/.
> 1163535714813.s7-crawlera.crc for DFSClient_task_0005_m_000010_0 on
> client s7-crawlera because pendingCreates is non-null.
> ...
> 2006-11-14 13:10:50,457 WARN dfs.StateChange - DIR*
> NameSystem.startFile: failed to create file /mq/alerts/fetcher/.
> 1163538650385.s7-crawlera.crc for DFSClient_task_0005_m_000019_0 on
> client s7-crawlera because pendingCreates is non-null.
>
> These may be correlated with entries in the datanode logs:
>
> 2006-11-14 13:10:46,387 WARN dfs.DataNode - DataXCeiver
> java.io.IOException: Block blk_-8970705341808748626 is valid, and
> cannot be written to.
> at org.apache.hadoop.dfs.FSDataset.writeToBlock
> (FSDataset.java:285)
> at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock
> (DataNode.java:663)
> at org.apache.hadoop.dfs.DataNode$DataXceiver.run
> (DataNode.java:507)
> at java.lang.Thread.run(Thread.java:595)
>
> The namenode also noticed a few redundant addStoredBlock requests:
>
> 2006-11-14 13:10:56,103 WARN dfs.StateChange - BLOCK*
> NameSystem.addStoredBlock: Redundant addStoredBlock request
> received for blk_-4818305081009377296 on s10-crawlera:50010
>
> But I *don't* find anything in the datanode logs mentioning these
> blocks.
>
> Finally (perhaps in response to me killing this first fetch job?),
> we see our
> first ClosedChannelException in the namenode log:
>
> 2006-11-14 13:40:44,634 WARN ipc.Server - handler output error
> java.nio.channels.ClosedChannelException
> at sun.nio.ch.SocketChannelImpl.ensureWriteOpen
> (SocketChannelImpl.java:125)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
> at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer
> (SocketChannelOutputStream.java:120)
> at org.apache.hadoop.ipc.SocketChannelOutputStream.write
> (SocketChannelOutputStream.java:93)
> at java.io.BufferedOutputStream.flushBuffer
> (BufferedOutputStream.java:65)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:
> 123)
> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>
> This is associated with "unknown opcode" messages in most of the
> datanode logs:
>
> s3-crawlera.prod.krugle.net: 2006-11-14 13:40:43,678 WARN
> dfs.DataNode - DataXCeiver
> s3-crawlera.prod.krugle.net: java.io.IOException: Unknown opcode
> for incoming data stream
> s3-crawlera.prod.krugle.net: at org.apache.hadoop.dfs.DataNode
> $DataXceiver.run(DataNode.java:515)
> s3-crawlera.prod.krugle.net: at java.lang.Thread.run
> (Thread.java:595)
>
>
> ----------------------------------------------------------------------
> ----------
> Here I did a stop-all.sh, deployed a version of nutch with a bit
> more logging,
> and then did a start-all.sh
> ----------------------------------------------------------------------
> ----------
>
>
> The jobtracker log shows that it was apparently unable to connect
> to the
> namenode just after it gets launched the second time. The
> connection error
> repeats over and over again, but eventually we get a few messages
> about the
> web server starting up that may indicate that the jobtracker was
> finally able
> to connect to the namenode(?)
>
> 2006-11-14 13:59:52,321 WARN mapred.JobTracker - Starting tracker
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
> at java.net.PlainSocketImpl.connectToAddress
> (PlainSocketImpl.java:195)
> at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> at java.net.Socket.connect(Socket.java:507)
> at java.net.Socket.connect(Socket.java:457)
> at java.net.Socket.<init>(Socket.java:365)
> at java.net.Socket.<init>(Socket.java:207)
> at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
> at org.apache.hadoop.ipc.Client.call(Client.java:297)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
> at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown
> Source)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
> at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
> at org.apache.hadoop.dfs.DistributedFileSystem.<init>
> (DistributedFileSystem.java:47)
> at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
> at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
> at org.apache.hadoop.mapred.JobTracker.startTracker
> (JobTracker.java:68)
> at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
> ...
> 2006-11-14 14:00:18,352 WARN mapred.JobTracker - Starting tracker
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
> at java.net.PlainSocketImpl.connectToAddress
> (PlainSocketImpl.java:195)
> at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> at java.net.Socket.connect(Socket.java:507)
> at java.net.Socket.connect(Socket.java:457)
> at java.net.Socket.<init>(Socket.java:365)
> at java.net.Socket.<init>(Socket.java:207)
> at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
> at org.apache.hadoop.ipc.Client.call(Client.java:297)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
> at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown
> Source)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
> at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
> at org.apache.hadoop.dfs.DistributedFileSystem.<init>
> (DistributedFileSystem.java:47)
> at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
> at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
> at org.apache.hadoop.mapred.JobTracker.startTracker
> (JobTracker.java:68)
> at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
> 2006-11-14 14:00:19,516 INFO util.Credential - Checking Resource
> aliases
> 2006-11-14 14:00:19,534 INFO http.HttpServer - Version Jetty/5.1.4
> 2006-11-14 14:00:20,423 INFO util.Container - Started
> org.mortbay.jetty.servlet.WebApplicationHandler@69267649
> 2006-11-14 14:00:20,492 INFO util.Container - Started
> WebApplicationContext[/,/]
> 2006-11-14 14:00:20,493 INFO util.Container - Started HttpContext
> [/logs,/logs]
> 2006-11-14 14:00:20,493 INFO util.Container - Started HttpContext
> [/static,/static]
> 2006-11-14 14:00:20,496 INFO http.SocketListener - Started
> SocketListener on 0.0.0.0:50030
> 2006-11-14 14:00:20,496 INFO util.Container - Started
> org.mortbay.jetty.Server@b6e39f
>
> Once the second crawl gets started
> 2006-11-14 14:02:03,392 INFO crawl.Injector - Injector: starting
> the same issues with the new crawldb, the message queue, and
> redundant blocks
> mentioned above appear again in the namenode log (not shown).
>
> The DFS/RPC problems finally begin to surface in the jobtracker log
> (and these
> bubble up to tool output) several hours into the second crawl:
>
> 2006-11-14 17:07:27,539 WARN fs.DFSClient - Problem renewing
> lease for DFSClient_-284911018: java.net.SocketTimeoutException:
> timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:312)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
> at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run
> (DFSClient.java:440)
> at java.lang.Thread.run(Thread.java:595)
>
> Nothing special appears during this same timeframe in the namenode
> log (just
> the continuing message queue and redundant blocks errors).
>
> Finally, there is an indication in the jobtracker log (embedded in the
> continuing DFS/RPC problems) that the fetch job has died and is
> being garbage
> collected:
>
> 2006-11-14 18:46:37,681 WARN fs.DFSClient - Problem renewing
> lease for DFSClient_-284911018: java.net.SocketTimeoutException:
> timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:312)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
> at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run
> (DFSClient.java:440)
> at java.lang.Thread.run(Thread.java:595)
>
> 2006-11-14 18:47:37,339 WARN mapred.JobInProgress - Error
> cleaning up job_0005: java.net.SocketTimeoutException: timed out
> waiting for rpc response
> 2006-11-14 18:47:38,696 WARN fs.DFSClient - Problem renewing
> lease for DFSClient_-284911018: java.net.SocketTimeoutException:
> timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:312)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
> at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run
> (DFSClient.java:440)
> at java.lang.Thread.run(Thread.java:595)
>
> About 20 minutes later, we finally see the first of a series of
> broken pipe
> exceptions in the namenode log:
>
> 2006-11-14 19:09:54,440 WARN ipc.Server - handler output error
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcher.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
> at sun.nio.ch.IOUtil.write(IOUtil.java:60)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:302)
> at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer
> (SocketChannelOutputStream.java:120)
> at org.apache.hadoop.ipc.SocketChannelOutputStream.write
> (SocketChannelOutputStream.java:93)
> at java.io.BufferedOutputStream.flushBuffer
> (BufferedOutputStream.java:65)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:
> 123)
> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
>
> About 30 minutes later, the first of a series of closed channel
> exceptions
> appears in the namenode log (along with an indication that the
> namenode may
> have lost connection to all but one of its datanodes):
>
> 2006-11-14 19:39:43,142 WARN ipc.Server - handler output error
> java.nio.channels.ClosedChannelException
> at sun.nio.ch.SocketChannelImpl.ensureWriteOpen
> (SocketChannelImpl.java:125)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
> at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer
> (SocketChannelOutputStream.java:120)
> at org.apache.hadoop.ipc.SocketChannelOutputStream.write
> (SocketChannelOutputStream.java:93)
> at java.io.BufferedOutputStream.flushBuffer
> (BufferedOutputStream.java:65)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:
> 123)
> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
> 2006-11-14 19:39:43,142 WARN fs.FSNamesystem - Replication
> requested of 3 is larger than cluster size (1). Using cluster size.
>
> The namenode process eventually dies, though the jobtracker
> survives (as do
> all of the datanodes and tasktrackers on the slaves).
> --
> ------------------------
> Chris Schneider
> TransPac Software, Inc.
> Schmed@TransPac.com
> ------------------------