You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hermann Rokicz <he...@googlemail.com> on 2007/02/11 22:02:43 UTC
Limitations of intranet crawling
Hi,
I'm planing to use nutch to crawl between 1 and 2 millionen domains.
>From the documentation i guess intranet crawling would be the right
method.
Are there known problems with intranet crawling and this size of domainlist?
Regards,
Hermann!
Re: Limitations of intranet crawling
Posted by Hermann Rokicz <he...@googlemail.com>.
On 2/11/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
> There shouldn't be a problem with lists that size. We have done initial
> injections with > 5MM pages per fetchlist.
Do I have to add a regular expression for every single domain in
conf/crawl-urlfilter.txt or is there any easier and probably faster
way?
Re: Limitations of intranet crawling
Posted by Dennis Kubes <nu...@dragonflymc.com>.
There shouldn't be a problem with lists that size. We have done initial
injections with > 5MM pages per fetchlist.
Dennis Kubes
Hermann Rokicz wrote:
> Hi,
>
> I'm planing to use nutch to crawl between 1 and 2 millionen domains.
> From the documentation i guess intranet crawling would be the right
> method.
>
> Are there known problems with intranet crawling and this size of
> domainlist?
>
> Regards,
> Hermann!