You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tomi NA <he...@gmail.com> on 2006/07/31 17:13:02 UTC

max file size vs. available RAM size: crawl uses up all available memory

I am trying to crawl/index a shared folder in the office LAN: that
means a lot of .zip files, a lot of big .pdfs (>5 MB) etc.
I sacrificed performance for memory effectiveness where I found the
tradeoff ("indexer.mergeFactor" = 5, "indexer.minMergeDocs" = 5), but
the crawl process breaks if I set "file.content.limit" to, say, 10 MB
even thought I'm testing on a 1GB RAM machine. To be fair, some
300-400 MB are already taken by misc programs, but stil...

I invoke nutch like so:
./bin/nutch crawl -local urldir -dir crawldir -depth 20 -topN 1000

What I'd like to know is:
1) where does all the memory go?
2) how can I reduce the peak memory requirements?

To reiterate, I'm just testing at the moment, but I need to index
documents at any tree depth and any document smaller than, say,
10-20MB and I hope I don't need 5+GB of RAM to do it.

TIA,
t.n.a.