You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Sami Siren (JIRA)" <ji...@apache.org> on 2006/10/29 21:41:17 UTC

[jira] Created: (NUTCH-395) Increase fetching speed

Increase fetching speed
-----------------------

                 Key: NUTCH-395
                 URL: http://issues.apache.org/jira/browse/NUTCH-395
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 0.8.1
            Reporter: Sami Siren
         Assigned To: Sami Siren


There have been some discussion on nutch mailing lists about fetcher being slow, this patch tried to address that. the patch is just a quich hack and needs some cleaning up, it also currently applies to 0.8 branch and not trunk and it has also not been tested in large. What it changes?

Metadata - the original metadata uses spellchecking, new version does not (a decorator is provided that can do it and it should perhaps be used where http headers are handled but in most of the cases the functionality is not required)

Reading/writing various data structures - patch tries to do io more efficiently see the patch for details.

Initial benchmark:

A small benchmark was done to measure the performance of changes with a script that basically does the following:
-inject a list of urls into a fresh crawldb
-create fetchlist (10k urls pointing to local filesystem)
-fetch
-updatedb

original code from 0.8-branch:
real    10m51.907s
user    10m9.914s
sys     0m21.285s

after applying the patch
real    4m15.313s
user    3m42.598s
sys     0m18.485s



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Resolved: (NUTCH-395) Increase fetching speed

Posted by Sami Siren <ss...@gmail.com>.

from what version are you "upgrading" from? I guess pre rev. 464654?

If so, see [1] for additional info.

--
  Sami Siren

[1] http://wiki.apache.org/nutch/Upgrading_from_0%2e8%2ex_to_0%2e9

AJ Chen wrote:
> Sami,
> Thanks for resolving this serious issue.  I just updated my code from trunk
> and plan to test fetch speed. But ,there is a runtime error related to
> switching from UTF8 to Text. Since the error is from hadoop, how do I fix
> it?
> 
> java.lang.ClassCastException: org.apache.hadoop.io.UTF8
>    at org.apache.nutch.crawl.Generato r$Selector.map(Generator.java:108)
>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
>    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
> :105)
> 
> Thanks,
> AJ

Re: [jira] Resolved: (NUTCH-395) Increase fetching speed

Posted by AJ Chen <ca...@gmail.com>.

Sami,
Thanks for resolving this serious issue.  I just updated my code from trunk
and plan to test fetch speed. But ,there is a runtime error related to
switching from UTF8 to Text. Since the error is from hadoop, how do I fix
it?

java.lang.ClassCastException: org.apache.hadoop.io.UTF8
    at org.apache.nutch.crawl.Generato r$Selector.map(Generator.java:108)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:105)

Thanks,
AJ


On 11/13/06, Sami Siren (JIRA) <ji...@apache.org> wrote:
>
>      [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]
>
> Sami Siren resolved NUTCH-395.
> ------------------------------
>
>     Fix Version/s: 0.9.0
>        Resolution: Fixed
>
> applied to trunk with some additional whitespace changes.
>
> > Increase fetching speed
> > -----------------------
> >
> >                 Key: NUTCH-395
> >                 URL: http://issues.apache.org/jira/browse/NUTCH-395
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: fetcher
> >    Affects Versions: 0.8.1, 0.9.0
> >            Reporter: Sami Siren
> >         Assigned To: Sami Siren
> >             Fix For: 0.9.0
> >
> >         Attachments: nutch-0.8-performance.txt,
> NUTCH-395-trunk-metadata-only-2.patch, NUTCH-395-trunk-metadata-only.patch
> >
> >
> > There have been some discussion on nutch mailing lists about fetcher
> being slow, this patch tried to address that. the patch is just a quich hack
> and needs some cleaning up, it also currently applies to 0.8 branch and
> not trunk and it has also not been tested in large. What it changes?
> > Metadata - the original metadata uses spellchecking, new version does
> not (a decorator is provided that can do it and it should perhaps be used
> where http headers are handled but in most of the cases the functionality is
> not required)
> > Reading/writing various data structures - patch tries to do io more
> efficiently see the patch for details.
> > Initial benchmark:
> > A small benchmark was done to measure the performance of changes with a
> script that basically does the following:
> > -inject a list of urls into a fresh crawldb
> > -create fetchlist (10k urls pointing to local filesystem)
> > -fetch
> > -updatedb
> > original code from 0.8-branch:
> > real    10m51.907s
> > user    10m9.914s
> > sys     0m21.285s
> > after applying the patch
> > real    4m15.313s
> > user    3m42.598s
> > sys     0m18.485s
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>


-- 
AJ Chen, PhD
http://web2express.org

Re: [jira] Commented: (NUTCH-395) Increase fetching speed

Posted by AJ Chen <ca...@gmail.com>.

Linux box, opteron 2Ghz, 2GB RAM, DSL download bandwidth up to 5mbps.

This is a new crawldb, crawling on 4000 selected sites, total ~1 million
pages fetched after last run.

use default regex-urlfilter.txt except for :
-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|lha|md5|mov|
mp3|mp4|mpg|msi|ogg|png|pps|ppt|ps|psd|ram|ris|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xls|z|zip)\)?$
-[*!@#]

additional filter to limit urls to the selected domains  (hashtable
implementation)

plugins:
protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic

use default org.apache.nutch.net.URLNormalizer

thanks for helping,
AJ


parse only html and text

On 11/22/06, Sami Siren <ss...@gmail.com> wrote:
>
> What kind of hardware are you running on? Your pages per sec ratio seems
> very low to me.
>
> How big was your crawldb when you started and how big was it at end?
>
> What kind of filters and normalizers are you using?
>
> --
>   Sami Siren
>
> AJ Chen wrote:
> > I checked out the code from trunk after Sami committed the change. I
> > started
> > out a new crawl db and run several cycles of crawl sequentially on one
> > linux
> > server. See below for the real numbers from my test.  The performance is
> > still poor because the crawler still spend too much time in reduce and
> > update operations.
> >
> > #crawl cycle: topN=200000
> > 2006-11-17 17:25:27,367 INFO  crawl.Generator - Generator: segment:
> > crawl/segments/20061117172527
> > 2006-11-17 17:47:45,837 INFO  fetcher.Fetcher - Fetcher: segment:
> > crawl/segments/20061117172527
> > # 8 hours fetching ~200000 pages
> > 2006-11-18 03:13:31,992 INFO  mapred.LocalJobRunner - 183644 pages, 5506
> > errors, 5.4 pages/s, 1043 kb/s,
> > # 4 hours doing "reduce"
> > 2006-11-18 07:30:38,085 INFO  crawl.CrawlDb - CrawlDb update: starting
> > # 4 hours update db
> > 2006-11-18 11:17:54,000 INFO  crawl.CrawlDb - CrawlDb update: done
> >
> > #crawl sycle: topN=500,000 pages
> > 2006-11-18 13:22:51,530 INFO  crawl.Generator - Generator: segment:
> > crawl/segments/20061118132251
> > 2006-11-18 14:50:07,006 INFO  fetcher.Fetcher - Fetcher: segment:
> > crawl/segments/20061118132251
> > # fetching for 16 hours
> > 2006-11-19 06:53:34,923 INFO  mapred.LocalJobRunner - 394343 pages,
> 19050
> > errors, 6.8 pages/s, 1439 kb/s,
> > # reduce for 11 hours
> > 2006-11-19 17:49:15,778 INFO  crawl.CrawlDb - CrawlDb update: segment:
> > crawl/segments/20061118132251
> > # update db for 10 hours
> > 2006-11-20 03:55:22,882 INFO  crawl.CrawlDb - CrawlDb update: done
> >
> > #crawl cycle: topN=600,000 pages
> > 2006-11-20 08:14:51,463 INFO  crawl.Generator - Generator: segment:
> > crawl/segments/20061120081451
> > 2006-11-20 11:31:22,384 INFO  fetcher.Fetcher - Fetcher: segment:
> > crawl/segments/20061120081451
> > #fetching for 18 hours
> > 2006-11-21 06:00:08,504 INFO  mapred.LocalJobRunner - 410078 pages,
> 26316
> > errors, 6.2 pages/s, 1257 kb/s,
> > #reduce for 11 hours
> > 2006-11-21 17:26:38,213 INFO  crawl.CrawlDb - CrawlDb update: starting
> > #update for 13 hours
> > 2006-11-22 06:25:48,592 INFO  crawl.CrawlDb - CrawlDb update: done
> >
> >
> > -AJ
> >
> >
> > On 11/13/06, Andrzej Bialecki (JIRA) <ji...@apache.org> wrote:
> >>
> >>     [
> >>
> http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12449292
> ]
> >>
> >>
> >> Andrzej Bialecki  commented on NUTCH-395:
> >> -----------------------------------------
> >>
> >> +1 - this patch looks good to me - if you could just fix the whitespace
> >> issues prior to committing, so that it conforms to the coding style ...
> >>
> >> > Increase fetching speed
> >> > -----------------------
> >> >
> >> >                 Key: NUTCH-395
> >> >                 URL: http://issues.apache.org/jira/browse/NUTCH-395
> >> >             Project: Nutch
> >> >          Issue Type: Improvement
> >> >          Components: fetcher
> >> >    Affects Versions: 0.9.0, 0.8.1
> >> >            Reporter: Sami Siren
> >> >         Assigned To: Sami Siren
> >> >         Attachments: nutch-0.8-performance.txt,
> >> NUTCH-395-trunk-metadata-only-2.patch,
> >> NUTCH-395-trunk-metadata-only.patch
> >> >
> >> >
> >> > There have been some discussion on nutch mailing lists about fetcher
> >> being slow, this patch tried to address that. the patch is just a
> >> quich hack
> >> and needs some cleaning up, it also currently applies to 0.8 branch and
> >> not trunk and it has also not been tested in large. What it changes?
> >> > Metadata - the original metadata uses spellchecking, new version does
> >> not (a decorator is provided that can do it and it should perhaps be
> used
> >> where http headers are handled but in most of the cases the
> >> functionality is
> >> not required)
> >> > Reading/writing various data structures - patch tries to do io more
> >> efficiently see the patch for details.
> >> > Initial benchmark:
> >> > A small benchmark was done to measure the performance of changes with
> a
> >> script that basically does the following:
> >> > -inject a list of urls into a fresh crawldb
> >> > -create fetchlist (10k urls pointing to local filesystem)
> >> > -fetch
> >> > -updatedb
> >> > original code from 0.8-branch:
> >> > real    10m51.907s
> >> > user    10m9.914s
> >> > sys     0m21.285s
> >> > after applying the patch
> >> > real    4m15.313s
> >> > user    3m42.598s
> >> > sys     0m18.485s
> >>
> >> --
> >> This message is automatically generated by JIRA.
> >> -
> >> If you think it was sent incorrectly contact one of the administrators:
> >> http://issues.apache.org/jira/secure/Administrators.jspa
> >> -
> >> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >>
> >>
> >>
> >
> >
>
>


-- 
AJ Chen, PhD
Palo Alto, CA
http://web2express.org

Re: [jira] Commented: (NUTCH-395) Increase fetching speed

Posted by Sami Siren <ss...@gmail.com>.

What kind of hardware are you running on? Your pages per sec ratio seems 
very low to me.

How big was your crawldb when you started and how big was it at end?

What kind of filters and normalizers are you using?

--
  Sami Siren

AJ Chen wrote:
> I checked out the code from trunk after Sami committed the change. I 
> started
> out a new crawl db and run several cycles of crawl sequentially on one 
> linux
> server. See below for the real numbers from my test.  The performance is
> still poor because the crawler still spend too much time in reduce and
> update operations.
> 
> #crawl cycle: topN=200000
> 2006-11-17 17:25:27,367 INFO  crawl.Generator - Generator: segment:
> crawl/segments/20061117172527
> 2006-11-17 17:47:45,837 INFO  fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061117172527
> # 8 hours fetching ~200000 pages
> 2006-11-18 03:13:31,992 INFO  mapred.LocalJobRunner - 183644 pages, 5506
> errors, 5.4 pages/s, 1043 kb/s,
> # 4 hours doing "reduce"
> 2006-11-18 07:30:38,085 INFO  crawl.CrawlDb - CrawlDb update: starting
> # 4 hours update db
> 2006-11-18 11:17:54,000 INFO  crawl.CrawlDb - CrawlDb update: done
> 
> #crawl sycle: topN=500,000 pages
> 2006-11-18 13:22:51,530 INFO  crawl.Generator - Generator: segment:
> crawl/segments/20061118132251
> 2006-11-18 14:50:07,006 INFO  fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061118132251
> # fetching for 16 hours
> 2006-11-19 06:53:34,923 INFO  mapred.LocalJobRunner - 394343 pages, 19050
> errors, 6.8 pages/s, 1439 kb/s,
> # reduce for 11 hours
> 2006-11-19 17:49:15,778 INFO  crawl.CrawlDb - CrawlDb update: segment:
> crawl/segments/20061118132251
> # update db for 10 hours
> 2006-11-20 03:55:22,882 INFO  crawl.CrawlDb - CrawlDb update: done
> 
> #crawl cycle: topN=600,000 pages
> 2006-11-20 08:14:51,463 INFO  crawl.Generator - Generator: segment:
> crawl/segments/20061120081451
> 2006-11-20 11:31:22,384 INFO  fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061120081451
> #fetching for 18 hours
> 2006-11-21 06:00:08,504 INFO  mapred.LocalJobRunner - 410078 pages, 26316
> errors, 6.2 pages/s, 1257 kb/s,
> #reduce for 11 hours
> 2006-11-21 17:26:38,213 INFO  crawl.CrawlDb - CrawlDb update: starting
> #update for 13 hours
> 2006-11-22 06:25:48,592 INFO  crawl.CrawlDb - CrawlDb update: done
> 
> 
> -AJ
> 
> 
> On 11/13/06, Andrzej Bialecki (JIRA) <ji...@apache.org> wrote:
>>
>>     [
>> http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12449292] 
>>
>>
>> Andrzej Bialecki  commented on NUTCH-395:
>> -----------------------------------------
>>
>> +1 - this patch looks good to me - if you could just fix the whitespace
>> issues prior to committing, so that it conforms to the coding style ...
>>
>> > Increase fetching speed
>> > -----------------------
>> >
>> >                 Key: NUTCH-395
>> >                 URL: http://issues.apache.org/jira/browse/NUTCH-395
>> >             Project: Nutch
>> >          Issue Type: Improvement
>> >          Components: fetcher
>> >    Affects Versions: 0.9.0, 0.8.1
>> >            Reporter: Sami Siren
>> >         Assigned To: Sami Siren
>> >         Attachments: nutch-0.8-performance.txt,
>> NUTCH-395-trunk-metadata-only-2.patch, 
>> NUTCH-395-trunk-metadata-only.patch
>> >
>> >
>> > There have been some discussion on nutch mailing lists about fetcher
>> being slow, this patch tried to address that. the patch is just a 
>> quich hack
>> and needs some cleaning up, it also currently applies to 0.8 branch and
>> not trunk and it has also not been tested in large. What it changes?
>> > Metadata - the original metadata uses spellchecking, new version does
>> not (a decorator is provided that can do it and it should perhaps be used
>> where http headers are handled but in most of the cases the 
>> functionality is
>> not required)
>> > Reading/writing various data structures - patch tries to do io more
>> efficiently see the patch for details.
>> > Initial benchmark:
>> > A small benchmark was done to measure the performance of changes with a
>> script that basically does the following:
>> > -inject a list of urls into a fresh crawldb
>> > -create fetchlist (10k urls pointing to local filesystem)
>> > -fetch
>> > -updatedb
>> > original code from 0.8-branch:
>> > real    10m51.907s
>> > user    10m9.914s
>> > sys     0m21.285s
>> > after applying the patch
>> > real    4m15.313s
>> > user    3m42.598s
>> > sys     0m18.485s
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of the administrators:
>> http://issues.apache.org/jira/secure/Administrators.jspa
>> -
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>
> 
>