You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Jeroen van Vianen (JIRA)" <ji...@apache.org> on 2010/06/30 23:32:53 UTC

[jira] Updated: (NUTCH-838) Add timing information to all Tool classes

     [ https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeroen van Vianen updated NUTCH-838:
------------------------------------

    Attachment: timings.patch

Here's the patch to add timings to all Tool classes.

Additionally, it removes some @Override where they were used incorrectly and adds the ability to use '#' to mark a line as a comment while injecting new URLs

> Add timing information to all Tool classes
> ------------------------------------------
>
>                 Key: NUTCH-838
>                 URL: https://issues.apache.org/jira/browse/NUTCH-838
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator, indexer, linkdb, parser
>    Affects Versions: 1.1
>         Environment: JDK 1.6, Linux & Windows
>            Reporter: Jeroen van Vianen
>             Fix For: 2.0
>
>         Attachments: timings.patch
>
>
> Am happily trying to crawl a few hundred URLs incrementally. Performance is degrading suddenly after the index reaches approximately 25000 URLs.
> At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, solrindex, solrdedup batch takes approximately half an hour with topN 500, but elapsed times now increase to 00h45m,  01h15m, 01h30m with every batch. As I'm uncertain which of the phases takes so much time I decided to add start and finish times to al classes that implement Tool so I at least have a feeling and can review them in a log file.
> Am using pretty old hardware, but I am planning to recrawl these URLs on a regular basis and if every iteration is going to take more and more time, index updates will be few and far between :-(
> I added timing information to *all* Tool classes for consistency whereas there are only 10 or so Tools that are really interesting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.