You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ferdy Galema (JIRA)" <ji...@apache.org> on 2012/11/09 11:32:13 UTC

[jira] [Commented] (NUTCH-1370) Expose exact number of urls injected @runtime

    [ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493885#comment-13493885 ] 

Ferdy Galema commented on NUTCH-1370:
-------------------------------------

Hi,

I checked the patch, it seems you are simply grabbing standard counters for the purpose. Did you know you can simply write out all counters with LOG.info(currentJob.getCounters())

The best way for the purpose of this issue is to simply make a custom counter. So for every url injected (every context.write() call) you call something like:
context.getCounter("injector", "urls_injected").increment(1);

Then outputting this count at the end of the job is trivial. You could output all counters (like stated above) or only the "injector" group, whatever pleases you.
                
> Expose exact number of urls injected @runtime 
> ----------------------------------------------
>
>                 Key: NUTCH-1370
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1370
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.2
>
>         Attachments: NUTCH-1370-2.x.patch
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira