You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by se...@enhancededge.com on 2006/05/24 19:23:21 UTC

Problems fetching a high number of sites

Hello,

I?m currently having a problem fetching a high number of sites using Nutch
0.7.1. The only configuration change made in nutch-site.xml was
fetcher.threads.fetch = 40, the rest is default. The following is the error
output when attempting to fetch 10 million pages:

060524 094711 SEVERE error writing output:java.lang.OutOfMemoryError
java.lang.OutOfMemoryError
060524 094711 SEVERE error writing output:java.io.IOException: key out of order:
 6430941 after 6430941
java.io.IOException: key out of order: 6430941 after 6430941
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:280)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
060524 094722 SEVERE error writing output:java.io.IOException: key out of order:
6430941 after 6430941
java.io.IOException: key out of order: 6430941 after 6430941
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:280)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)

As you can see it stops fetching at about 6430941 sites, before this the process
would go slower and slower starting when it hit about 3 million.

>From the error type, it looks like we are dealing with memory. The Nutch process
never comes close to using up all my free system memory, only about 25%. My
question is, would this be corrected by allotting more memory to the Nutch
fetcher process (the java command) and if so how would this be done or is there
something that needs to be corrected in the configuration files?

Thanks,

Sean Dean

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.