You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Edward Whittaker <ed...@ewdw.com> on 2005/12/22 13:12:20 UTC
java.io.IOException: Input/output error with bin/nutch updatedb db
seg command
Hi, I have successfully run the nutch tutorial at
http://lucene.apache.org/nutch/tutorial.html using nutch 0.7.1 and have
been able to set up the web server etc. using a crawl of a few hundred
thousand sites with no problems at all.
Having now crawled, in addition, a few million sites (in one go i.e.
it's all in one segment) I am now running into trouble trying to update
the webdb with the contents of that segment. I was initially running on
a RedHat-9 6Gb RAM 32-bit machine and although none of the file sizes or
RAM use exceeded the 32-bit limits, I then swapped to a SuSE-8.1 (for
AMD64) 12Gb RAM machine using the 64-bit jdk1.5.0_06 and specifically
when I run the following command:
~/nutch-0.7.1/bin/nutch updatedb db segments/20051216172239
where ls -l segments/20051216172239/* gives:
segments/20051216172239/content:
???? 3918828
-rw-r--r-- 1 9 coe 4012718873 12?? 17 18:23 data
-rw-r--r-- 1 9 coe 158068 12?? 17 18:23 index
segments/20051216172239/fetcher:
???? 148232
-rw-r--r-- 1 9 coe 151626004 12?? 17 18:23 data
-rw-r--r-- 1 9 coe 158068 12?? 17 18:23 index
segments/20051216172239/fetchlist:
???? 106204
-rw-r--r-- 1 9 coe 108590211 12?? 16 17:32 data
-rw-r--r-- 1 9 coe 157004 12?? 16 17:32 index
segments/20051216172239/parse_data:
???? 3873864
-rw-r--r-- 1 9 coe 3966675725 12?? 17 18:23 data
-rw-r--r-- 1 9 coe 158068 12?? 17 18:23 index
segments/20051216172239/parse_text:
???? 1120748
-rw-r--r-- 1 9 coe 1147485081 12?? 17 18:23 data
-rw-r--r-- 1 9 coe 158068 12?? 17 18:23 index
I get the following errors:
run java in /home/9/jdk1.5.0_06
051222 195653 parsing file:/home/9/nutch-0.7.1/conf/nutch-default.xml
051222 195653 parsing file:/home/lkr109/nutch-0.7.1/conf/nutch-site.xml
051222 195654 No FS indicated, using default:local
051222 195654 Updating db
051222 195705 Updating for segments/20051216172239
051222 195705 Processing document 0
051222 195705 Plugins: looking in: /home/9/nutch-0.7.1/plugins
051222 195705 not including: /home/9/nutch-0.7.1/plugins/query-more
051222 195705 parsing: /home/9/nutch-0.7.1/plugins/query-site/plugin.xml
051222 195705 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
051222 195705 parsing: /home/9/nutch-0.7.1/plugins/parse-html/plugin.xml
051222 195705 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
051222 195705 parsing: /home/9/nutch-0.7.1/plugins/parse-text/plugin.xml
051222 195706 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-ext
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-pdf
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-rss
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/query-basic/plugin.xml
051222 195706 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
051222 195706 not including: /home/9/nutch-0.7.1/plugins/index-more
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-js
051222 195706 parsing:
/home/9/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
051222 195706 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
051222 195706 not including: /home/9/nutch-0.7.1/plugins/protocol-ftp
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-msword
051222 195706 not including: /home/9/nutch-0.7.1/plugins/creativecommons
051222 195706 not including: /home/9/nutch-0.7.1/plugins/ontology
051222 195706 parsing:
/home/9/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
051222 195706 not including: /home/9/nutch-0.7.1/plugins/protocol-file
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/protocol-http/plugin.xml
051222 195706 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
051222 195706 not including: /home/9/nutch-0.7.1/plugins/clustering-carrot2
051222 195706 not including: /home/9/nutch-0.7.1/plugins/language-identifier
051222 195706 not including: /home/9/nutch-0.7.1/plugins/urlfilter-prefix
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/query-url/plugin.xml
051222 195706 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/index-basic/plugin.xml
051222 195706 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
051222 195706 not including: /home/9/nutch-0.7.1/plugins/protocol-httpclient
051222 195706 found resource regex-urlfilter.txt at
file:/home/9/nutch-0.7.1/conf/regex-urlfilter.txt
051222 195706 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
051222 195710 Processing document 1000
051222 195712 Processing document 2000
051222 195714 Processing document 3000
051222 195716 Processing document 4000
...
...
051222 202623 Processing document 777000
051222 202625 Processing document 778000
051222 202625 Finishing update
Exception in thread "main" java.io.IOException: Input/output error
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:194)
at
org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:83)
at
org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:37)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.readFully(DataInputStream.java:176)
at
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
at
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
at
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:309)
at
org.apache.nutch.io.SequenceFile$Sorter$MergeStream.next(SequenceFile.java:725)
at
org.apache.nutch.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:755)
at
org.apache.nutch.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:654)
at
org.apache.nutch.io.SequenceFile$Sorter.mergePass(SequenceFile.java:591)
at
org.apache.nutch.io.SequenceFile$Sorter.sort(SequenceFile.java:419)
at
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:535)
at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
at
org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
at
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
After this failure the contents of the db directory are as follows (ls
-ltr db/*/*):
-rw-r--r-- 1 9 coe 17 12?? 20 20:03 db/webdb/stats
db/webdb/linksByMD5:
???? 686364
-rw-r--r-- 1 9 coe 697121985 12?? 20 19:59 data
-rw-r--r-- 1 9 coe 5713088 12?? 20 19:59 index
db/webdb/linksByURL:
???? 686364
-rw-r--r-- 1 9 coe 697121985 12?? 20 20:03 data
-rw-r--r-- 1 9 coe 5710026 12?? 20 20:03 index
db/webdb/pagesByMD5:
???? 355200
-rw-r--r-- 1 9 coe 3081749 12?? 20 20:05 index
-rw-r--r-- 1 9 coe 360637849 12?? 20 20:05 data
db/webdb/pagesByURL:
???? 505324
-rw-r--r-- 1 9 coe 1805388 12?? 20 20:08 index
-rw-r--r-- 1 9 coe 515641937 12?? 20 20:08 data
db/webdb.new/tmp:
???? 5344972
-rw-r--r-- 1 9 coe 75 12?? 22 19:57 pagesByMD5.out
-rw-r--r-- 1 9 coe 75 12?? 22 19:57 linksByURL.out
-rw-r--r-- 1 9 coe 774132072 12?? 22 20:26 linksByMD5.out
-rw-r--r-- 1 9 coe 4008686084 12?? 22 20:26 pagesByURL.out
-rw-r--r-- 1 9 coe 690415479 12?? 22 20:42 pagesByURL.out.sorted
so it seems like something went wrong when the 2 sorted streams
(pagesByURL.out.sorted.0 and pagesByURL.out.sorted.1) were being merged
into pagesByURL.out.sorted. A minute or so prior to dying those 2 files
had looked as follows:
-rw-r--r-- 1 9 coe 4008697873 12?? 22 20:33
pagesByURL.out.sorted.0
-rw-r--r-- 1 9 coe 3831078912 12?? 22 20:41
pagesByURL.out.sorted.1
To be honest the above output has not been 100% repeatable. i.e. I have
got the above output every time except once. On that occasion processing
got further than dying on processing the pagesByURL but instead died on
processing the linksByMD5. I am not particularly au fait with Java so
any explcit help would be much appreciated.
Thanks, Ed