You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/07/14 10:43:51 UTC
[jira] Updated: (NUTCH-830) ScoringFilter to restrict the crawl to
the hosts/domains listed in the seeds
[ https://issues.apache.org/jira/browse/NUTCH-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-830:
--------------------------------
Priority: Minor (was: Major)
> ScoringFilter to restrict the crawl to the hosts/domains listed in the seeds
> ----------------------------------------------------------------------------
>
> Key: NUTCH-830
> URL: https://issues.apache.org/jira/browse/NUTCH-830
> Project: Nutch
> Issue Type: New Feature
> Affects Versions: 1.1
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Priority: Minor
> Fix For: 2.0
>
> Attachments: NUTCH-830.patch
>
>
> The DomainURLFilter allows to specify the domains to consider for a crawl. This works fine but requires to edit a list of domain / hosts manually. The patch presented here offers the same functionality but uses a different mechanism as we use a custom scoring filter to filter the outlinks.
> 1. add a metadata to your seed list e.g. '_origin_' with as values the seed URL
> e.g. http://www.cnn.com/ _origin_=http://www.cnn.com/
> 2. The custom scoring filter would take care of :
> * transmitting the origin metadata to its outlinks
> * remove from the outlinks the ones which do not have the same host / domain as the origin
> The parameter _scoring.insite.mode_ allows to specify whether to restrict on the host or domain. The parameter _scoring.insite.addOriginOnInject_ allows to addition of the metadata during the injection step and reuses the URL automatically.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.