You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Robert Scavilla <rs...@gmail.com> on 2018/08/08 17:26:03 UTC

rejected by filters

Hello and Thank you for helping. For some nutch is rejecting the domain
https://www.penn.museum/

The regex-urlfilter is: +.
seeding with https://www.penn.museum/

And on crawl it keeps giving:
Injector: Total urls rejected by filters: 1

This is the only time I've had this issue and was wondering if the .museum
TLD was the problem??

Re: rejected by filters

Posted by BlackIce <bl...@gmail.com>.
I think you are correct in your assumption.
According to this:

https://issues.apache.org/jira/browse/NUTCH-2620?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Nutch asumes that the TLD is no longer than 4 characters, this is being in
the process of being fixed in the next release, which should be out shortly.

Greetings

On Wed, Aug 8, 2018 at 7:26 PM Robert Scavilla <rs...@gmail.com> wrote:

> Hello and Thank you for helping. For some nutch is rejecting the domain
> https://www.penn.museum/
>
> The regex-urlfilter is: +.
> seeding with https://www.penn.museum/
>
> And on crawl it keeps giving:
> Injector: Total urls rejected by filters: 1
>
> This is the only time I've had this issue and was wondering if the .museum
> TLD was the problem??
>