You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sami Siren (JIRA)" <ji...@apache.org> on 2006/11/10 17:44:39 UTC

[jira] Commented: (NUTCH-395) Increase fetching speed

    [ http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12448795 ] 
            
Sami Siren commented on NUTCH-395:
----------------------------------

>>have you measured what made the biggest impact on performance - changes to Metadata, or
>>changes to IO in FetcherOutput?
>did not have time yet, I would quess that IO changes make most signifigant part. 

After more digging my initial guess might not have been correct. By not touching IO at all
I am able to get same improvement changing the trunk when comparing to nightly builds as
I reported before on 0.8 branch.

This is good, because we don't need to change file formats at all.



> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http://issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists about fetcher being slow, this patch tried to address that. the patch is just a quich hack and needs some cleaning up, it also currently applies to 0.8 branch and not trunk and it has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking, new version does not (a decorator is provided that can do it and it should perhaps be used where http headers are handled but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira