You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/03/03 02:28:21 UTC
OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled
I have nutch running on a Compaq DL 380 w/ 1GB of RAM, not my best
machine, but I am only doing a limited crawl of about 52 urls. When I
do the crawl with depth = 3 or even 6, it completes, when I do it at 10,
it has been running out of memory.
2 questions
1. how do i restart the crawl?
I have seen the tuturial, whch says
"
Recover the pages already fetched and than restart the fetcher. You'll
need to create a file fetcher.done in the segment directory an than:
updatedb, generate and fetch . Assuming your index is at /index
% touch /index/segments/2005somesegment/fetcher.done
% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
% bin/nutch generate /index/db/ /index/segments/2005somesegment/
% bin/nutch fetch /index/segments/2005somesegment
All the pages that were not crawled will be re-generated for fetch. If
you fetched lots of pages, and don't want to have to re-fetch them
again, this is the best way.
",
but I have more than one segment, do I only need to do this for the last
one in time, or all of them?
2. how to I index what I have already crawled?
I have seen the indexing section in the tutorial, when I run bin/nutch
invertlinks it gives me a Exception in thread "main"
java.lang.NoClassDefFoundError: invertlinks
using cygwin
The fetcher exited with a
060302 165825 SEVERE error writing output:java.lang.OutOfMemoryError:
Java heap space
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged. Exiting fetcher.
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)
Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice)
http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
Free Open Source Tax Software
entrance point of Nutch search page
Posted by Michael Ji <fj...@yahoo.com>.
hi,
Which JSP file is the entrance for Nutch search page.
I saw nutch using
search(Query query, int numHits, String dedupField,
String sortField, boolean reverse)
to get the search result.
But not sure which JSP triggers this function.
Is it in tomcat container?
thanks,
Michael,
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
RE: OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled
Posted by Richard Braman <rb...@bramantax.com>.
I think this may be a bug.
-----Original Message-----
From: Richard Braman [mailto:rbraman@bramantax.com]
Sent: Thursday, March 02, 2006 8:28 PM
To: nutch-dev@lucene.apache.org
Subject: OutOfMemoryError/Restarting Crawl/Indexing what has already
been crawled
I have nutch running on a Compaq DL 380 w/ 1GB of RAM, not my best
machine, but I am only doing a limited crawl of about 52 urls. When I
do the crawl with depth = 3 or even 6, it completes, when I do it at 10,
it has been running out of memory.
2 questions
1. how do i restart the crawl?
I have seen the tuturial, whch says
"
Recover the pages already fetched and than restart the fetcher. You'll
need to create a file fetcher.done in the segment directory an than:
updatedb, generate and fetch . Assuming your index is at /index
% touch /index/segments/2005somesegment/fetcher.done
% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
% bin/nutch generate /index/db/ /index/segments/2005somesegment/
% bin/nutch fetch /index/segments/2005somesegment
All the pages that were not crawled will be re-generated for fetch. If
you fetched lots of pages, and don't want to have to re-fetch them
again, this is the best way.
",
but I have more than one segment, do I only need to do this for the last
one in time, or all of them?
2. how to I index what I have already crawled?
I have seen the indexing section in the tutorial, when I run bin/nutch
invertlinks it gives me a Exception in thread "main"
java.lang.NoClassDefFoundError: invertlinks
using cygwin
The fetcher exited with a
060302 165825 SEVERE error writing output:java.lang.OutOfMemoryError:
Java heap space
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged. Exiting fetcher. at
org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)
Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice)
http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
Free Open Source Tax Software