You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2023/02/24 16:26:00 UTC
[jira] [Assigned] (NUTCH-2973) Single domain names (eg https://localnet) can't be crawled - filtering fails
[ https://issues.apache.org/jira/browse/NUTCH-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel reassigned NUTCH-2973:
--------------------------------------
Assignee: Sebastian Nagel
> Single domain names (eg https://localnet) can't be crawled - filtering fails
> ----------------------------------------------------------------------------
>
> Key: NUTCH-2973
> URL: https://issues.apache.org/jira/browse/NUTCH-2973
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.19
> Environment: Nutch 1.19, checked on Windows 10 and Ubuntu. Both have the same issue.
> 'm trying to crawl a SharePoint intranet using nutch where the URLs are similar to:
>
> {{https://localnet/something.aspx}}
> The issue is that Nutch is rejecting any url with a single element domain name such as localnet above. "localnet.com" is not rejected, nor is "local.localnet". It almost feels as if there's a chunk of code within Nutch that's unrelated to the filtering mechanisms that rejects URLs outright if they don't have a WWW style format and a WWW-style domain such as .COM
> Error message:
>
> {{Total urls rejected by filters: 1}}
> I've checked and updated all the _filter_ files in the conf directory. Even making then incredibly permissive (effectively "crawl everything") has not helped.
> Reporter: David Smith
> Assignee: Sebastian Nagel
> Priority: Blocker
>
> There appears to be a bug within the core of Nutch that fails to permit any single domain name URLs to be crawled. Example:
> {{https://{*}localnet{*}/something.aspx}}
> The issue is that Nutch is rejecting any url with a single element domain name such as *localnet* above. "localnet.com" is not rejected, nor is "local.localnet". It almost feels as if there's a chunk of code within Nutch that's unrelated to the filtering mechanisms that rejects URLs outright if they don't have a WWW style format and a WWW-style domain such as .COM
> Error message:
> {{Total urls rejected by filters: 1}}
> I've checked and updated all the filter files in the conf directory. Even making then incredibly permissive (effectively "crawl everything") has not helped. Immediately that a dot (.) is added to the domain name it is not rejected - eg blah.localnet.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)