You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Chaushu, Shani" <sh...@intel.com> on 2015/07/18 16:22:00 UTC

Nutch doesn't crawl all seed

Hi
I use nutch 1.9
I have ~80k links that I want to crawl
When I crawl all together it crawled only ~30k
When I crawled not all together but few times parts of the seed, it crawled lot more.
The db.max.outlinks.per.page set to -1
There is any parameter that maybe restrict the number of pages the nutch will crawl?

Thanks,
Shani

---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Nutch doesn't crawl all seed

Posted by Imtiaz Shakil Siddique <sh...@gmail.com>.
As far as I know,

If you are crawling using the crawl script found inside
$nutch_home/bin/crawl then it will generate top 50000 links to start
fetching and parsing thus many URLs inside your seed file won't be crawled
in the first iteration.

You should also check regex-urlfilter.txt inside conf directory to make
sure that all of your seed urls are injected or not.

Hope it helps.