You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by P....@Deutschepost.de on 2007/10/09 17:24:54 UTC

HowTo crawl many files (ZIP with DOC,PDF....) correctly?

Hello,

I'm using nutch to crawl the intranet.
I've set the file size limit quite high (2 Mb, and default is just kinda
64k), therefore I also set the fetcher threads very low (between 1 and
4).

But while the fetcher runs, my memory usage is too high for my notebook
(1Gb Ram, and java.exe needs ~700Mb, after this point everything gets
very slow of course).

My question would be, if there is a possibility to fetch/crawl many many
files (ZIP files with PDF, XLS, DOC and PPT) with less memory usage?
Or may be did I just configure my nutch wrong?

I'm running nutch via intranet search with i.e. "bin/nutch crawl myurls
-dir crawldb -depth 1 -threads 2 -topN 50"
Since I got all the URLs in my text file I choose "depth 1".

Can I configure nutch somehow not to fetch too many files at once? And
start fetching again after indexing the 1st part nutch got?
I would be really happy if someone could give me a hind.

Cheers
Lam Nguyen

Re: HowTo crawl many files (ZIP with DOC,PDF....) correctly?

Posted by Dennis Kubes <ku...@apache.org>.

Try setting your child opts to -Xmx512M or higher.  This config variable 
is found in the hadoop-default.xml file but copy and change it in your 
hadoop-site.xml file.

Dennis Kubes

P.Nguyen2@Deutschepost.de wrote:
> Hello,
> 
> I'm using nutch to crawl the intranet.
> I've set the file size limit quite high (2 Mb, and default is just kinda
> 64k), therefore I also set the fetcher threads very low (between 1 and
> 4).
> 
> But while the fetcher runs, my memory usage is too high for my notebook
> (1Gb Ram, and java.exe needs ~700Mb, after this point everything gets
> very slow of course).
> 
> My question would be, if there is a possibility to fetch/crawl many many
> files (ZIP files with PDF, XLS, DOC and PPT) with less memory usage?
> Or may be did I just configure my nutch wrong?
> 
> I'm running nutch via intranet search with i.e. "bin/nutch crawl myurls
> -dir crawldb -depth 1 -threads 2 -topN 50"
> Since I got all the URLs in my text file I choose "depth 1".
> 
> Can I configure nutch somehow not to fetch too many files at once? And
> start fetching again after indexing the 1st part nutch got?
> I would be really happy if someone could give me a hind.
> 
> Cheers
> Lam Nguyen
>