You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mark Round <ma...@ahc.uk.com> on 2009/08/20 12:22:50 UTC

Possible memory leak in Nutch-1.0 ?

Hi all,

I am experiencing serious out of memory errors when querying Nutch, and
would appreciate any pointers or advice. I have a Nutch index that I'm
searching using a simple servlet. This servlet queries the index and
returns the results as XML, so other systems in my network can make use
of the index as a web service. 

In a nutshell, the problem seems to be that after successive queries to
this servlet, the Tenured Gen increases until I run out of heap space. 

I am running Nutch-1.0, with the NUTCH-738 and NUTCH-746 patches applied
(more about that below), Tomcat 6.0.20 and Sun's JVM, 1.6.0_12-b04 on
Debian Lenny 32-bit. I have also tested with OpenJDK, and got the same
results.

My servlet just does the following :

Configuration nutchConf = NutchConfiguration.create();
Path configPath = new Path(NUTCH_DIR + "/conf/" + site+
"/nutch-site.xml");
nutchConf.addResource(configPath);
NutchBean nutchBean = new NutchBean(nutchConf);
Query nutchQuery = Query.parse(nutchSearchString, nutchConf);
Hits nutchHits = nutchBean.search(nutchQuery, maxResults);
...
... Format the results as XML and output them
...
nutchBean.close();

After querying it a few hundred times, my Tenured Gen is up to 50Mb,
after a few thousand requests, I end up with over 500Mb used. I can of
course increase my heap size, but the problem is that no matter what I
set it to, eventually it will all get consumed and the only option is to
restart Tomcat.

I have obtained a heap dump and run it through jhat, but to be honest
I'm not really sure what I'm looking for. I've made the dump available
at http://www.markround.com/static/tomcat.hprof, in case that helps
anyone investigate further.

For what it's worth, I didn't seem to get this issue with Nutch-0.9. 

Regarding the two patches I have applied - I had to make use of them as
otherwise, I get a lot of threads in the TIMED_WAITING state, which
according to Lambda Probe are stuck here :

java.lang.Thread.sleep ( native code )
org.apache.nutch.searcher.FetchedSegments$SegmentUpdater.run (
FetchedSegments.java:115 )

With the 2 patches applied, I still get lots of these "stuck" threads,
but they do seem to eventually get cleaned up; I wonder if this could
have anything to do with the problem ?

Please let me know if there are any other diagnostics I can run, or
information I can provide.

Many thanks,

-Mark