You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/07/28 16:05:09 UTC

[jira] [Commented] (NUTCH-1071) Crawldb update to total counts per status

    [ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072358#comment-13072358 ] 

Markus Jelsma commented on NUTCH-1071:
--------------------------------------

Great work! Very useful indeed.

> Crawldb update to total counts per status
> -----------------------------------------
>
>                 Key: NUTCH-1071
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1071
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Trivial
>             Fix For: 1.4
>
>         Attachments: NUTCH-1071.patch
>
>
> The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step. 
> This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira