You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/06/23 15:05:51 UTC

[jira] Updated: (NUTCH-830) ScoringFilter to restrict the crawl to the hosts/domains listed in the seeds

     [ https://issues.apache.org/jira/browse/NUTCH-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-830:
--------------------------------

    Attachment: NUTCH-830.patch

> ScoringFilter to restrict the crawl to the hosts/domains listed in the seeds
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-830
>                 URL: https://issues.apache.org/jira/browse/NUTCH-830
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.1
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 2.0
>
>         Attachments: NUTCH-830.patch
>
>
> The DomainURLFilter allows to specify the domains to consider for a crawl. This works fine but requires to edit a list of domain / hosts manually. The patch presented here offers the same functionality but uses a different mechanism as we use a custom scoring filter to filter the outlinks. 
> 1. add a metadata to your seed list e.g. '_origin_' with as values the seed URL
> e.g. http://www.cnn.com/    _origin_=http://www.cnn.com/
> 2. The custom scoring filter would take care of :
>     * transmitting the origin metadata to its outlinks
>     * remove from the outlinks the ones which do not have the same host / domain as the origin
> The parameter _scoring.insite.mode_ allows to specify whether to restrict on the host or domain. The parameter _scoring.insite.addOriginOnInject_ allows to addition of the metadata during the injection step and reuses the URL automatically.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.