You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2019/01/06 20:06:00 UTC

[jira] [Updated] (NUTCH-2627) Fetcher to optionally filter URLs

     [ https://issues.apache.org/jira/browse/NUTCH-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel updated NUTCH-2627:
-----------------------------------
    Description: When running a large web crawl it happens that a webadmin requests to immediately stop crawling a certain domain. The default Nutch workflow applies URL filters only to seeds and outlinks. Applying filters during fetch list generation is expensive with a large CrawlDb (fetch lists are usually much shorter). Allowing the fetcher to optionally filter URLs would allow to apply changed filter rules to the next launched fetcher job even if the the segment has been already created (esp., if multiple segments are generated in one turn).  (was: When running a large web crawl it happens that a webadmin requests to immediately stop crawling a certain domain. The default Nutch workflow applies URL filters only to seeds and outlinks. Applying filters during fetch list generation is expensive with a large CrawlDb (fetch lists are usually much shorter).)

> Fetcher to optionally filter URLs
> ---------------------------------
>
>                 Key: NUTCH-2627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
> When running a large web crawl it happens that a webadmin requests to immediately stop crawling a certain domain. The default Nutch workflow applies URL filters only to seeds and outlinks. Applying filters during fetch list generation is expensive with a large CrawlDb (fetch lists are usually much shorter). Allowing the fetcher to optionally filter URLs would allow to apply changed filter rules to the next launched fetcher job even if the the segment has been already created (esp., if multiple segments are generated in one turn).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)