You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/03/03 02:28:21 UTC

OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled

I have nutch running on a Compaq DL 380 w/ 1GB of RAM, not my best
machine, but I am only doing a limited crawl of about 52 urls.  When I
do the crawl with depth = 3 or even 6, it completes, when I do it at 10,
it has been running out of memory.  
 
2 questions 
 
1. how do i restart the crawl?  
I have seen the tuturial, whch says
"

 Recover the pages already fetched and than restart the fetcher. You'll
need to create a file fetcher.done in the segment directory an than:
updatedb, generate and fetch . Assuming your index is at /index 

% touch /index/segments/2005somesegment/fetcher.done 

% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/

% bin/nutch generate /index/db/ /index/segments/2005somesegment/

% bin/nutch fetch /index/segments/2005somesegment

All the pages that were not crawled will be re-generated for fetch. If
you fetched lots of pages, and don't want to have to re-fetch them
again, this is the best way.

", 

but I have more than one segment, do I only need to do this for the last
one in time, or all of them?

2. how to I index what I have already crawled?
I have seen the indexing section in the tutorial, when I run bin/nutch
invertlinks it gives me a Exception in thread "main"
java.lang.NoClassDefFoundError: invertlinks 
using cygwin
 
The fetcher exited with a
 
060302 165825 SEVERE error writing output:java.lang.OutOfMemoryError:
Java heap space
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged.  Exiting fetcher.
 at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
 at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
 at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)

 

Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice) 

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> 
Free Open Source Tax Software

entrance point of Nutch search page

Posted by Michael Ji <fj...@yahoo.com>.

hi,

Which JSP file is the entrance for Nutch search page.

I saw nutch using

search(Query query, int numHits, String dedupField,
String sortField, boolean reverse) 

to get the search result.

But not sure which JSP triggers this function.

Is it in tomcat container?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

RE: OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled

Posted by Richard Braman <rb...@bramantax.com>.

I think this may be a bug.

-----Original Message-----
From: Richard Braman [mailto:rbraman@bramantax.com] 
Sent: Thursday, March 02, 2006 8:28 PM
To: nutch-dev@lucene.apache.org
Subject: OutOfMemoryError/Restarting Crawl/Indexing what has already
been crawled

I have nutch running on a Compaq DL 380 w/ 1GB of RAM, not my best
machine, but I am only doing a limited crawl of about 52 urls.  When I
do the crawl with depth = 3 or even 6, it completes, when I do it at 10,
it has been running out of memory.  

2 questions 

1. how do i restart the crawl?  
I have seen the tuturial, whch says
"

 Recover the pages already fetched and than restart the fetcher. You'll
need to create a file fetcher.done in the segment directory an than:
updatedb, generate and fetch . Assuming your index is at /index 

% touch /index/segments/2005somesegment/fetcher.done 

% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/

% bin/nutch generate /index/db/ /index/segments/2005somesegment/

% bin/nutch fetch /index/segments/2005somesegment

All the pages that were not crawled will be re-generated for fetch. If
you fetched lots of pages, and don't want to have to re-fetch them
again, this is the best way.

", 

but I have more than one segment, do I only need to do this for the last
one in time, or all of them?

2. how to I index what I have already crawled?
I have seen the indexing section in the tutorial, when I run bin/nutch
invertlinks it gives me a Exception in thread "main"
java.lang.NoClassDefFoundError: invertlinks 
using cygwin

The fetcher exited with a

060302 165825 SEVERE error writing output:java.lang.OutOfMemoryError:
Java heap space
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged.  Exiting fetcher.  at
org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
 at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
 at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)

Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice) 

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> 
Free Open Source Tax Software