You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hermann Rokicz <he...@googlemail.com> on 2007/02/11 22:02:43 UTC

Limitations of intranet crawling

Hi,

I'm planing to use nutch to crawl between 1 and 2 millionen domains.
>From the documentation i guess intranet crawling would be the right
method.

Are there known problems with intranet crawling and this size of domainlist?

Regards,
Hermann!

Re: Limitations of intranet crawling

Posted by Hermann Rokicz <he...@googlemail.com>.
On 2/11/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
> There shouldn't be a problem with lists that size.  We have done initial
> injections with > 5MM pages per fetchlist.

Do I have to add a regular expression for every single domain in
conf/crawl-urlfilter.txt or is there any easier and probably faster
way?

Re: Limitations of intranet crawling

Posted by Dennis Kubes <nu...@dragonflymc.com>.
There shouldn't be a problem with lists that size.  We have done initial 
injections with > 5MM pages per fetchlist.

Dennis Kubes

Hermann Rokicz wrote:
> Hi,
> 
> I'm planing to use nutch to crawl between 1 and 2 millionen domains.
>  From the documentation i guess intranet crawling would be the right
> method.
> 
> Are there known problems with intranet crawling and this size of 
> domainlist?
> 
> Regards,
> Hermann!