You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jean Vence <jv...@gmail.com> on 2016/05/25 09:44:58 UTC

Nutch crawling other countries domain despite db.ignore.external.links

I am trying to crawl a single site and have used
db.ignore.external.links=true flag. But it seems to fail because it
will crawl sites with a different country extension so for example: if
the seed is mysite.com, it will crawl mysite.com, mysite.es &
mysite.it -

I dont want to use a regex to exclude them because I have multiple
URLs and don't want to maintain a long list.

Is this a known bug?

Thanks,

Jean Vence

Re: Nutch crawling other countries domain despite db.ignore.external.links

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Jean,

  db.ignore.external.links=true
should work. Which version of Nutch are you using?
How is the property set? Does your seed list only
contain URLs from mysite.com, and none from mysite.es?

Regards,
Sebastian

On 05/25/2016 11:44 AM, Jean Vence wrote:
> I am trying to crawl a single site and have used
> db.ignore.external.links=true flag. But it seems to fail because it
> will crawl sites with a different country extension so for example: if
> the seed is mysite.com, it will crawl mysite.com, mysite.es &
> mysite.it -
> 
> I dont want to use a regex to exclude them because I have multiple
> URLs and don't want to maintain a long list.
> 
> Is this a known bug?
> 
> Thanks,
> 
> Jean Vence
>