You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Drew Hite <hi...@gmail.com> on 2008/06/16 19:09:34 UTC

db.ignore.external.links=true and redirects

Hello,
I would like restrict a crawl to a domain specified in a seed url without
using the urlfilter-regex plugin.  The db.ignore.external.links property
looked like it would do the trick, but I've found that links that are
redirected outside the seed url get through.  For example, if I start at
http://www.xyz.com and Nutch finds a link pointing to
http://www.xyz.com/blog which is actually a redirection to
http://blog.xyz.com then Nutch will start fetching pages from
http://blog.xyz.com even though it was not in seed url file.  Is this the
intended behavior for the db.ignore.external.links property?  If so, is
there a way to restrict a crawl to particular site without the regex
filter?  If not, would it be useful to create a patch to check the toUrl
hosts against the hosts specified in the original seed list?

Thanks,
Drew

Re: db.ignore.external.links=true and redirects

Posted by Drew Hite <hi...@gmail.com>.
I should have mentioned that I'm working with the trunk.

On Mon, Jun 16, 2008 at 1:09 PM, Drew Hite <hi...@gmail.com> wrote:

> Hello,
> I would like restrict a crawl to a domain specified in a seed url without
> using the urlfilter-regex plugin.  The db.ignore.external.links property
> looked like it would do the trick, but I've found that links that are
> redirected outside the seed url get through.  For example, if I start at
> http://www.xyz.com and Nutch finds a link pointing to
> http://www.xyz.com/blog which is actually a redirection to
> http://blog.xyz.com then Nutch will start fetching pages from
> http://blog.xyz.com even though it was not in seed url file.  Is this the
> intended behavior for the db.ignore.external.links property?  If so, is
> there a way to restrict a crawl to particular site without the regex
> filter?  If not, would it be useful to create a patch to check the toUrl
> hosts against the hosts specified in the original seed list?
>
> Thanks,
> Drew
>
>
>