You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2015/12/22 22:29:46 UTC

[jira] [Commented] (NUTCH-2065) Domain URL filter to support protocols

    [ https://issues.apache.org/jira/browse/NUTCH-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068757#comment-15068757 ] 

Sebastian Nagel commented on NUTCH-2065:
----------------------------------------

* in general: wouldn't a URL normalizer be preferable? If URLs of one protocol are suppressed, missing links may get lost. Some documents of a site which uses https mostly may be referenced from few http pages only.
* before domain url filter was agnostic regarding the protocol: shouldn't this behaviour be kept in all cases, i.e., also for ftp? Almost everything now is http or https, but may we should keep the interpretation of "no protocol specified" -> "any protocol allowed".

> Domain URL filter to support protocols
> --------------------------------------
>
>                 Key: NUTCH-2065
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2065
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.10
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-2065.patch, NUTCH-2065.patch
>
>
> The filter allows all protocols for all whitelisted domains, hosts or suffixes but it usually makes little sense to index both http and https URL's of the same domain. This is not unlike the host URL filter, which prevents indexing of duplicate hosts e.g. apache.org and www.apache.org.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)