You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Rüdiger Schulz (SkyGate)" <sc...@skygate.de> on 2007/03/01 18:37:28 UTC

Memory leakduring crawlr?

Hello,

I'm currently using Nutch to crawl a small selection of websites. All
together, there are about 60.000 HTML pages and about 10.000 PDFs, which
shall all go into the index. I'm using a custom crawl class, which repeats
the crawl iterations not based on a given number (depth) but instead
repeating until there are no more unfetched pages (quite similar to a Python
script posted some weeks ago). This is ok, as most of these don't change
that often, so I could do such a long initial crawl, and because of URL
filtering etc. I know it's going to be a definite number of pages.

When I run this over the weekend :) I always get a OutOfMemoryException
(permgen space) after about 400 iterations, during which about 40.000 pages
get indexed. Increasing permgen space for the JVM only increased the time
until I get the error.

So I ran my Crawl feeding it with only a couple of URLs so that it's done
after about 30 minutes, and did some profiling with JDK6' jconsole, jmap and
jhat. I discovered in jconsole that after each iteration permgen increases
about 1MB as more classes are loaded. I'm comparing a jmap dump after about
5 iterations, and another after about 30 iterations, and discover the
following:

* permgen usage has almost doubled
* none of my custom plugins have more than 1 instance
* looking for nutch classes, I see that org.apache.nutch.plugin.Extension
and org.apache.nutch.plugin.PluginDescriptor have both gone from about 500
to about 3000 instances, org.apache.nutch.plugin.ExtensionPoint from 1500 to
1000 instances.
* looking further, I see that org.apache.nutch.plugin.PluginRepository and
org.apache.hadoop.mapred.JobConf have both increased from 16 to 99
instances.

Now I'm wondering if that behavior is intended. It seems there are 3 or 4
instances of JobConf created during each loop (in Generator.generate, in
Fetcher.fetch, in ParseSegment.parse and in CrawlDb.update). Is that really
necessary? I pass my JobConf in the constructor already.

What could I do to run the crawl fully? Increase permgen even more or is
there anything else?

Thanks for reading,

Rüdiger

--
View this message in context: http://www.nabble.com/Memory-leakduring-crawlr--tf3328411.html#a9254396
Sent from the Nutch - User mailing list archive at Nabble.com.