You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2017/07/28 10:07:02 UTC
[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104748#comment-16104748 ]
Sebastian Nagel commented on NUTCH-1932:
----------------------------------------
My suggestion would be to add a method {{orphanedScore(Text url, CrawlDatum datum)}} to the ScoringFilter interface and add a do-nothing default implementation to AbstractScoringFilter - this makes clear that the method is called when there is no "new" datum from a fetch (successful or not) and no new inlinks.
> Automatically remove orphaned pages
> -----------------------------------
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, e.g. it has no more other pages linking to it. If a page hasn't been linked to after markGoneAfter seconds, the page is marked as gone and is then removed by an indexer. If a page hasn't been linked to after markOrphanAfter seconds, the page is removed from the CrawlDB.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)