You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Edward Whittaker <ed...@ewdw.com> on 2005/12/22 13:12:20 UTC

java.io.IOException: Input/output error with bin/nutch updatedb db seg command

Hi, I have successfully run the nutch tutorial at 
http://lucene.apache.org/nutch/tutorial.html using nutch 0.7.1 and have 
been able to set up the web server etc. using a crawl of a few hundred 
thousand sites with no problems at all.

Having now crawled, in addition, a few million sites (in one go i.e. 
it's all in one segment) I am now running into trouble trying to update 
the webdb with the contents of that segment. I was initially running on 
a RedHat-9 6Gb RAM 32-bit machine and although none of the file sizes or 
RAM use exceeded the 32-bit limits, I then swapped to a SuSE-8.1 (for 
AMD64) 12Gb RAM machine using the 64-bit jdk1.5.0_06 and specifically 
when I run the following command:

~/nutch-0.7.1/bin/nutch updatedb db segments/20051216172239

where ls -l segments/20051216172239/* gives:

segments/20051216172239/content:
???? 3918828
-rw-r--r--    1 9   coe      4012718873 12?? 17 18:23 data
-rw-r--r--    1 9   coe        158068 12?? 17 18:23 index

segments/20051216172239/fetcher:
???? 148232
-rw-r--r--    1 9   coe      151626004 12?? 17 18:23 data
-rw-r--r--    1 9   coe        158068 12?? 17 18:23 index

segments/20051216172239/fetchlist:
???? 106204
-rw-r--r--    1 9   coe      108590211 12?? 16 17:32 data
-rw-r--r--    1 9   coe        157004 12?? 16 17:32 index

segments/20051216172239/parse_data:
???? 3873864
-rw-r--r--    1 9   coe      3966675725 12?? 17 18:23 data
-rw-r--r--    1 9   coe        158068 12?? 17 18:23 index

segments/20051216172239/parse_text:
???? 1120748
-rw-r--r--    1 9   coe      1147485081 12?? 17 18:23 data
-rw-r--r--    1 9   coe        158068 12?? 17 18:23 index

I get the following errors:

run java in /home/9/jdk1.5.0_06
051222 195653 parsing file:/home/9/nutch-0.7.1/conf/nutch-default.xml
051222 195653 parsing file:/home/lkr109/nutch-0.7.1/conf/nutch-site.xml
051222 195654 No FS indicated, using default:local
051222 195654 Updating db
051222 195705 Updating for segments/20051216172239
051222 195705 Processing document 0
051222 195705 Plugins: looking in: /home/9/nutch-0.7.1/plugins
051222 195705 not including: /home/9/nutch-0.7.1/plugins/query-more
051222 195705 parsing: /home/9/nutch-0.7.1/plugins/query-site/plugin.xml
051222 195705 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
051222 195705 parsing: /home/9/nutch-0.7.1/plugins/parse-html/plugin.xml
051222 195705 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
051222 195705 parsing: /home/9/nutch-0.7.1/plugins/parse-text/plugin.xml
051222 195706 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-ext
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-pdf
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-rss
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/query-basic/plugin.xml
051222 195706 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
051222 195706 not including: /home/9/nutch-0.7.1/plugins/index-more
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-js
051222 195706 parsing: 
/home/9/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
051222 195706 impl: point=org.apache.nutch.net.URLFilter 
class=org.apache.nutch.net.RegexURLFilter
051222 195706 not including: /home/9/nutch-0.7.1/plugins/protocol-ftp
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-msword
051222 195706 not including: /home/9/nutch-0.7.1/plugins/creativecommons
051222 195706 not including: /home/9/nutch-0.7.1/plugins/ontology
051222 195706 parsing: 
/home/9/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
051222 195706 not including: /home/9/nutch-0.7.1/plugins/protocol-file
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/protocol-http/plugin.xml
051222 195706 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.http.Http
051222 195706 not including: /home/9/nutch-0.7.1/plugins/clustering-carrot2
051222 195706 not including: /home/9/nutch-0.7.1/plugins/language-identifier
051222 195706 not including: /home/9/nutch-0.7.1/plugins/urlfilter-prefix
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/query-url/plugin.xml
051222 195706 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/index-basic/plugin.xml
051222 195706 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
051222 195706 not including: /home/9/nutch-0.7.1/plugins/protocol-httpclient
051222 195706 found resource regex-urlfilter.txt at 
file:/home/9/nutch-0.7.1/conf/regex-urlfilter.txt
051222 195706 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
051222 195710 Processing document 1000
051222 195712 Processing document 2000
051222 195714 Processing document 3000
051222 195716 Processing document 4000
...
...
051222 202623 Processing document 777000
051222 202625 Processing document 778000
051222 202625 Finishing update
Exception in thread "main" java.io.IOException: Input/output error
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:194)
        at 
org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:83)
        at 
org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:37)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
        at java.io.DataInputStream.readFully(DataInputStream.java:176)
        at 
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
        at 
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
        at 
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:309)
        at 
org.apache.nutch.io.SequenceFile$Sorter$MergeStream.next(SequenceFile.java:725)
        at 
org.apache.nutch.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:755)
        at 
org.apache.nutch.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:654)
        at 
org.apache.nutch.io.SequenceFile$Sorter.mergePass(SequenceFile.java:591)
        at 
org.apache.nutch.io.SequenceFile$Sorter.sort(SequenceFile.java:419)
        at 
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:535)
        at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
        at 
org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
        at 
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)



After this failure the contents of the db directory are as follows (ls 
-ltr db/*/*):

-rw-r--r--    1 9   coe            17 12?? 20 20:03 db/webdb/stats

db/webdb/linksByMD5:
???? 686364
-rw-r--r--    1 9   coe      697121985 12?? 20 19:59 data
-rw-r--r--    1 9   coe       5713088 12?? 20 19:59 index

db/webdb/linksByURL:
???? 686364
-rw-r--r--    1 9   coe      697121985 12?? 20 20:03 data
-rw-r--r--    1 9   coe       5710026 12?? 20 20:03 index

db/webdb/pagesByMD5:
???? 355200
-rw-r--r--    1 9   coe       3081749 12?? 20 20:05 index
-rw-r--r--    1 9   coe      360637849 12?? 20 20:05 data

db/webdb/pagesByURL:
???? 505324
-rw-r--r--    1 9   coe       1805388 12?? 20 20:08 index
-rw-r--r--    1 9   coe      515641937 12?? 20 20:08 data

db/webdb.new/tmp:
???? 5344972
-rw-r--r--    1 9   coe            75 12?? 22 19:57 pagesByMD5.out
-rw-r--r--    1 9   coe            75 12?? 22 19:57 linksByURL.out
-rw-r--r--    1 9   coe      774132072 12?? 22 20:26 linksByMD5.out
-rw-r--r--    1 9   coe      4008686084 12?? 22 20:26 pagesByURL.out
-rw-r--r--    1 9   coe      690415479 12?? 22 20:42 pagesByURL.out.sorted

so it seems like something went wrong when the 2 sorted streams 
(pagesByURL.out.sorted.0 and pagesByURL.out.sorted.1) were being merged 
into pagesByURL.out.sorted. A minute or so prior to dying those 2 files 
had looked as follows:

-rw-r--r--    1 9   coe      4008697873 12?? 22 20:33 
pagesByURL.out.sorted.0
-rw-r--r--    1 9   coe      3831078912 12?? 22 20:41 
pagesByURL.out.sorted.1

To be honest the above output has not been 100% repeatable. i.e. I have 
got the above output every time except once. On that occasion processing 
got further than dying on processing the pagesByURL but instead died on 
processing the linksByMD5. I am not particularly au fait with Java so 
any explcit help would be much appreciated.

Thanks, Ed