You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2015/03/11 11:52:38 UTC

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

    [ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356711#comment-14356711 ] 

Markus Jelsma commented on NUTCH-1932:
--------------------------------------

Hm yes, i've thought about using a scoring filter too. However, we do need some code in CrawlDbReducer.reduce() because in the end we want to completely remove the record from the CrawlDB. A work-around, maybe elegant but useful, would be to introduce the CrawlDatum to URL filtering and normalizing.
We have some other Nutch jobs that would benefit from having method signature like normalize(String url, CrawlDatum datum, String scope), same is true for filter.

> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.11
>
>         Attachments: NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink count of 1 is enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)