You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org> on 2011/12/29 15:39:30 UTC

[jira] [Updated] (NUTCH-1239) Webgraph should remove deleted pages from segment input

     [ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1239:
---------------------------------

    Attachment: NUTCH-1239-1.5-1.patch

Patch for 1.5. Little review would be appreciated. I added a BooleanWritable(false) for keys that no longer exist based on their CrawlDatum.status. If the reducer picks up that field along with the linkdatum objects it rejects the entire key because it is gone.
I needed a GenericWritable and lucky for us we already have NutchWritable, hence adding those two new classes.
I tested it and it seems to work.
                
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
>                 Key: NUTCH-1239
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1239
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira