You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2015/07/21 17:10:05 UTC

[jira] [Updated] (NUTCH-2065) Domain URL filter to support protocols

     [ https://issues.apache.org/jira/browse/NUTCH-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-2065:
---------------------------------
    Attachment: NUTCH-2065.patch

Patch for 1.10, should work in trunk. See supplied unit test. Other tests pass.

> Domain URL filter to support protocols
> --------------------------------------
>
>                 Key: NUTCH-2065
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2065
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.10
>            Reporter: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: NUTCH-2065.patch
>
>
> The filter allows all protocols for all whitelisted domains, hosts or suffixes but it usually makes little sense to index both http and https URL's of the same domain. This is not unlike the host URL filter, which prevents indexing of duplicate hosts e.g. apache.org and www.apache.org.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)