You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2017/07/28 10:07:02 UTC

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

    [ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104748#comment-16104748 ] 

Sebastian Nagel commented on NUTCH-1932:
----------------------------------------

My suggestion would be to add a method {{orphanedScore(Text url, CrawlDatum datum)}} to the ScoringFilter interface and add a do-nothing default implementation to AbstractScoringFilter - this makes clear that the method is called when there is no "new" datum from a fetch (successful or not) and no new inlinks.

> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, e.g. it has no more other pages linking to it. If a page hasn't been linked to after markGoneAfter seconds, the page is marked as gone and is then removed by an indexer.  If a page hasn't been linked to after markOrphanAfter seconds, the page is removed from the CrawlDB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)