You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/08/17 10:56:16 UTC

[jira] Resolved: (NUTCH-830) ScoringFilter to restrict the crawl to the hosts/domains listed in the seeds

     [ https://issues.apache.org/jira/browse/NUTCH-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-830.
---------------------------------

    Resolution: Not A Problem

This approach has a major flaw which is that it loses the links between two authorized host or domain names (which would harm their scores). It also lets redirections from the authorized host but not the subsequent outlinks, meaning that we can get a document with some content and hence indexed even if its host name does not match the seed it was found from. This is not good or bad in itself, just a bit counter intuitive.

This was probably just an interesting example of how to use the ScoringFilter but as far as the functionality goes, using the domainFilter should be a more satisfying approach.


> ScoringFilter to restrict the crawl to the hosts/domains listed in the seeds
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-830
>                 URL: https://issues.apache.org/jira/browse/NUTCH-830
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.1
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 2.0
>
>         Attachments: NUTCH-830.patch
>
>
> The DomainURLFilter allows to specify the domains to consider for a crawl. This works fine but requires to edit a list of domain / hosts manually. The patch presented here offers the same functionality but uses a different mechanism as we use a custom scoring filter to filter the outlinks. 
> 1. add a metadata to your seed list e.g. '_origin_' with as values the seed URL
> e.g. http://www.cnn.com/    _origin_=http://www.cnn.com/
> 2. The custom scoring filter would take care of :
>     * transmitting the origin metadata to its outlinks
>     * remove from the outlinks the ones which do not have the same host / domain as the origin
> The parameter _scoring.insite.mode_ allows to specify whether to restrict on the host or domain. The parameter _scoring.insite.addOriginOnInject_ allows to addition of the metadata during the injection step and reuses the URL automatically.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.