You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/07/27 11:56:00 UTC

[jira] [Commented] (NUTCH-2627) Fetcher to optionally filter URLs

    [ https://issues.apache.org/jira/browse/NUTCH-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559630#comment-16559630 ] 

ASF GitHub Bot commented on NUTCH-2627:
---------------------------------------

sebastian-nagel opened a new pull request #370: NUTCH-2627 Fetcher to optionally filter URLs
URL: https://github.com/apache/nutch/pull/370
 
 
   - filter and normalize URLs in QueueFeeder if `fetcher.filter.urls` resp. `fetcher.normalize.urls` are true

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Fetcher to optionally filter URLs
> ---------------------------------
>
>                 Key: NUTCH-2627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
> When running a large web crawl it happens that a webadmin requests to immediately stop crawling a certain domain. The default Nutch workflow applies URL filters only to seeds and outlinks. Applying filters during fetch list generation is expensive with a large CrawlDb (fetch lists are usually much shorter).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)