You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Vasja Ocvirk <va...@vizija.si> on 2006/07/26 16:23:41 UTC

0.8 much slower than 0.7

Hello,

I'm wondering if anyone can help. We injected 1000 seed URLs into Nutch 
0.7.2 (basic configuration + 1000 URLs in regexp filter) and it 
processed them in just few hours. We just switched to 0.8 with same 
configuration, same URLs, but it seems everything slowed down 
significantly. Crawl script has 60 threads -- same as before but now it 
works much slower.

Thanks!

Best,
Vasja

__________ NOD32 1.1533 (20060512) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com





Re: 0.8 much slower than 0.7

Posted by Vasja Ocvirk <va...@vizija.si>.
Here is some more text from the log. It seems that it slows down at
mapred.LocalJobRunner



2006-08-02 10:12:28,160 INFO  mapred.LocalJobRunner - 36 pages, 0 
errors, 0.3 pages/s, 51 kb/s,
2006-08-02 10:12:28,900 DEBUG http.Http - fetching 
http://www.foo.com/internet_aplikacije.php
2006-08-02 10:12:28,918 DEBUG http.Http - fetched 25812 bytes from 
http://www.foo.com/internet_aplikacije.php
2006-08-02 10:12:28,920 DEBUG parse.ParseUtil - Parsing 
[http://www.foo.com/internet_aplikacije.php] with 
[org.apache.nutch.parse.html.HtmlParser@bad8a8]
2006-08-02 10:12:28,920 DEBUG parse.html - 
http://www.foo.com/internet_aplikacije.php: setting encoding to ISO-8859-2
2006-08-02 10:12:28,920 DEBUG parse.html - Parsing...
2006-08-02 10:12:28,932 DEBUG parse.html - Meta tags for 
http://www.foo.com/internet_aplikacije.php: base=null, noCache=false, 
noFollow=false, noIndex=false, refresh=false, refreshHref=null
* general tags:
  - keywords   =       
cms,aplikacija,modul,vodenje,kontaktov,trgovina,upravljanje,vo??ilnice,anketa,internet 
trgovina,izdelava
  - author     =       Foo
  - description        =       Aplikacija za vodenje kontaktov. CMS - 
Sistem za upravljanje z vsebinami.
  - robots     =       INDEX,FOLLOW
* http-equiv tags:
  - content-type       =       text/html; charset=iso-8859-2

2006-08-02 10:12:28,932 DEBUG parse.html - Getting text...
2006-08-02 10:12:28,938 DEBUG parse.html - Getting title...
2006-08-02 10:12:28,938 DEBUG parse.html - Getting links...
2006-08-02 10:12:28,942 DEBUG parse.html - found 160 outlinks in 
http://www.foo.com/internet_aplikacije.php
2006-08-02 10:12:29,162 INFO  mapred.LocalJobRunner - 37 pages, 0 
errors, 0.3 pages/s, 52 kb/s,
2006-08-02 10:12:30,164 INFO  mapred.LocalJobRunner - 37 pages, 0 
errors, 0.3 pages/s, 52 kb/s,
2006-08-02 10:12:31,166 INFO  mapred.LocalJobRunner - 37 pages, 0 
errors, 0.3 pages/s, 51 kb/s,
2006-08-02 10:12:32,168 INFO  mapred.LocalJobRunner - 37 pages, 0 
errors, 0.3 pages/s, 51 kb/s,
2006-08-02 10:12:33,170 INFO  mapred.LocalJobRunner - 37 pages, 0 
errors, 0.3 pages/s, 50 kb/s,
2006-08-02 10:12:33,918 DEBUG http.Http - fetching 
http://www.foo.com/mediji.php

ATB,
Vasja

Zaheed Haque wrote:
>> One question, though: anyone knows how to set more verbose logging?
>
> You can edit your log4j properties under nutch/conf to enable DEBUG
> mode both for hadoop and nutch.
>
> Cheers
>
>> Thanks.
>>
>> 2006-08-01 19:58:37,576 INFO  fetcher.Fetcher - fetching
>> http://www.foo.com/faq.php
>> 2006-08-01 19:58:37,599 INFO  http.Http - http.proxy.host = null
>> 2006-08-01 19:58:37,599 INFO  http.Http - http.proxy.port = 8080
>> 2006-08-01 19:58:37,599 INFO  http.Http - http.timeout = 10000
>> 2006-08-01 19:58:37,600 INFO  http.Http - http.content.limit = 65536
>> 2006-08-01 19:58:37,600 INFO  http.Http - http.agent = siBot/siBot-0.1
>> (http://www.foo.com/; info@foo.com)
>> 2006-08-01 19:58:37,600 INFO  http.Http - fetcher.server.delay = 5000
>> 2006-08-01 19:58:37,600 INFO  http.Http - http.max.delays = 100
>> 2006-08-01 19:58:38,103 INFO  crawl.SignatureFactory - Using Signature
>> impl: org.apache.nutch.crawl.MD5Signature
>> 2006-08-01 19:58:38,145 INFO  fetcher.Fetcher - fetching
>> http://www.foo.com/izobrazevanje.php
>> 2006-08-01 19:58:43,569 INFO  fetcher.Fetcher - fetching
>> http://www.foo.com/kontakti.php
>> 2006-08-01 19:58:48,624 INFO  fetcher.Fetcher - fetching
>> http://www.foo.com/portfolio_mailing.php
>> 2006-08-01 19:58:53,553 INFO  fetcher.Fetcher - fetching
>> http://www.foo.com/online_katalogi.php
>> 2006-08-01 19:58:58,597 INFO  fetcher.Fetcher - fetching
>> http://www.foo.com/postavitev_sistemov.php
>> 2006-08-01 19:59:03,592 INFO  fetcher.Fetcher - fetching
>> http://www.foo.com/internet_aplikacije.php
>> 2006-08-01 19:59:08,655 INFO  fetcher.Fetcher - fetching
>> http://www.foo.com/gradivo.php
>>
>> ATB,
>> Vasja
>>
>> Stefan Groschupf wrote:
>> > Check:
>> > http://issues.apache.org/jira/browse/NUTCH-233
>> > and let us know if it helps.
>> > Stefan
>> >
>> >
>> > Am 31.07.2006 um 07:46 schrieb Matthew Holt:
>> >
>> >> Fetcher for one, and the mapreduce takes forever... IE the mapreduce
>> >> is kind of annoying... is it possible to disable it if I'm not
>> >> running on a DFS?
>> >> Matt
>> >>
>> >> 06/07/25 20:59:12 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:14 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:19 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:23 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:29 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:33 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:34 INFO mapred.JobClient:  map 100%  reduce 96%
>> >> 06/07/25 20:59:40 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:41 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:42 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:47 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:48 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:52 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 20:59:53 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 21:00:05 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 21:00:22 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 21:00:29 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 21:00:39 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 21:01:07 INFO mapred.LocalJobRunner: reduce > reduce
>> >> 06/07/25 21:01:08 INFO mapred.JobClient:  map 100%  reduce 97%
>> >> 06/07/25 21:01:16 INFO mapred.LocalJobRunner: reduce > reduce
>> >>
>> >>
>> >> Sami Siren wrote:
>> >>> Are you experiencing slowness in general or just on some parts of
>> >>> the process.
>> >>>
>> >>> Current fetcher is deadslow and it should be given immediate
>> >>> attention. there have been some talk about the issue but I havent
>> >>> seen any code yet.
>> >>>
>> >>> -- Sami Siren
>> >>>
>> >>> Matthew Holt wrote:
>> >>>> I agree. Is there anyway to disable something to speed it up? IE is
>> >>>> the map reduce currently needed if we're not on a DFS?
>> >>>>
>> >>>> Matt
>> >>>>
>> >>>> Vasja Ocvirk wrote:
>> >>>>
>> >>>>> Hello,
>> >>>>>
>> >>>>> I'm wondering if anyone can help. We injected 1000 seed URLs into
>> >>>>> Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and
>> >>>>> it processed them in just few hours. We just switched to 0.8 with
>> >>>>> same configuration, same URLs, but it seems everything slowed down
>> >>>>> significantly. Crawl script has 60 threads -- same as before but
>> >>>>> now it works much slower.
>> >>>>>
>> >>>>> Thanks!
>> >>>>>
>> >>>>> Best,
>> >>>>> Vasja
>> >>>>>
>> >>>>> __________ NOD32 1.1533 (20060512) Information __________
>> >>>>>
>> >>>>> This message was checked by NOD32 antivirus system.
>> >>>>> http://www.eset.com
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>> >
>> > __________ NOD32 1.1533 (20060512) Information __________
>> >
>> > This message was checked by NOD32 antivirus system.
>> > http://www.eset.com
>> >
>> >
>> >
>>
>

Re: 0.8 much slower than 0.7

Posted by Zaheed Haque <za...@gmail.com>.
> One question, though: anyone knows how to set more verbose logging?

You can edit your log4j properties under nutch/conf to enable DEBUG
mode both for hadoop and nutch.

Cheers

> Thanks.
>
> 2006-08-01 19:58:37,576 INFO  fetcher.Fetcher - fetching
> http://www.foo.com/faq.php
> 2006-08-01 19:58:37,599 INFO  http.Http - http.proxy.host = null
> 2006-08-01 19:58:37,599 INFO  http.Http - http.proxy.port = 8080
> 2006-08-01 19:58:37,599 INFO  http.Http - http.timeout = 10000
> 2006-08-01 19:58:37,600 INFO  http.Http - http.content.limit = 65536
> 2006-08-01 19:58:37,600 INFO  http.Http - http.agent = siBot/siBot-0.1
> (http://www.foo.com/; info@foo.com)
> 2006-08-01 19:58:37,600 INFO  http.Http - fetcher.server.delay = 5000
> 2006-08-01 19:58:37,600 INFO  http.Http - http.max.delays = 100
> 2006-08-01 19:58:38,103 INFO  crawl.SignatureFactory - Using Signature
> impl: org.apache.nutch.crawl.MD5Signature
> 2006-08-01 19:58:38,145 INFO  fetcher.Fetcher - fetching
> http://www.foo.com/izobrazevanje.php
> 2006-08-01 19:58:43,569 INFO  fetcher.Fetcher - fetching
> http://www.foo.com/kontakti.php
> 2006-08-01 19:58:48,624 INFO  fetcher.Fetcher - fetching
> http://www.foo.com/portfolio_mailing.php
> 2006-08-01 19:58:53,553 INFO  fetcher.Fetcher - fetching
> http://www.foo.com/online_katalogi.php
> 2006-08-01 19:58:58,597 INFO  fetcher.Fetcher - fetching
> http://www.foo.com/postavitev_sistemov.php
> 2006-08-01 19:59:03,592 INFO  fetcher.Fetcher - fetching
> http://www.foo.com/internet_aplikacije.php
> 2006-08-01 19:59:08,655 INFO  fetcher.Fetcher - fetching
> http://www.foo.com/gradivo.php
>
> ATB,
> Vasja
>
> Stefan Groschupf wrote:
> > Check:
> > http://issues.apache.org/jira/browse/NUTCH-233
> > and let us know if it helps.
> > Stefan
> >
> >
> > Am 31.07.2006 um 07:46 schrieb Matthew Holt:
> >
> >> Fetcher for one, and the mapreduce takes forever... IE the mapreduce
> >> is kind of annoying... is it possible to disable it if I'm not
> >> running on a DFS?
> >> Matt
> >>
> >> 06/07/25 20:59:12 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:14 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:19 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:23 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:29 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:33 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:34 INFO mapred.JobClient:  map 100%  reduce 96%
> >> 06/07/25 20:59:40 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:41 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:42 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:47 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:48 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:52 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 20:59:53 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 21:00:05 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 21:00:22 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 21:00:29 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 21:00:39 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 21:01:07 INFO mapred.LocalJobRunner: reduce > reduce
> >> 06/07/25 21:01:08 INFO mapred.JobClient:  map 100%  reduce 97%
> >> 06/07/25 21:01:16 INFO mapred.LocalJobRunner: reduce > reduce
> >>
> >>
> >> Sami Siren wrote:
> >>> Are you experiencing slowness in general or just on some parts of
> >>> the process.
> >>>
> >>> Current fetcher is deadslow and it should be given immediate
> >>> attention. there have been some talk about the issue but I havent
> >>> seen any code yet.
> >>>
> >>> -- Sami Siren
> >>>
> >>> Matthew Holt wrote:
> >>>> I agree. Is there anyway to disable something to speed it up? IE is
> >>>> the map reduce currently needed if we're not on a DFS?
> >>>>
> >>>> Matt
> >>>>
> >>>> Vasja Ocvirk wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> I'm wondering if anyone can help. We injected 1000 seed URLs into
> >>>>> Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and
> >>>>> it processed them in just few hours. We just switched to 0.8 with
> >>>>> same configuration, same URLs, but it seems everything slowed down
> >>>>> significantly. Crawl script has 60 threads -- same as before but
> >>>>> now it works much slower.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Best,
> >>>>> Vasja
> >>>>>
> >>>>> __________ NOD32 1.1533 (20060512) Information __________
> >>>>>
> >>>>> This message was checked by NOD32 antivirus system.
> >>>>> http://www.eset.com
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> > __________ NOD32 1.1533 (20060512) Information __________
> >
> > This message was checked by NOD32 antivirus system.
> > http://www.eset.com
> >
> >
> >
>

Re: 0.8 much slower than 0.7

Posted by Vasja Ocvirk <va...@vizija.si>.
We did some analyzing for 0.8:

- generate and updatedb works just fine while fetch is extremely slow. 
It takes 2 sec to fetch one page comparing to 0.7 who fetched 20 pages 
per sec.
- during fetch the box is at 100% CPU (3 Ghz pentium) which is quite odd.
- we checked log: URL fetching goes normal until "rawl.SignatureFactory 
- Using Signature impl: org.apache.nutch.crawl.MD5Signature"
log entry. After that fetching slows down.
- we injected only two URLs and also set both of them in regexp-filter.

Hope this can help someone.

One question, though: anyone knows how to set more verbose logging?

Thanks.

2006-08-01 19:58:37,576 INFO  fetcher.Fetcher - fetching
http://www.foo.com/faq.php
2006-08-01 19:58:37,599 INFO  http.Http - http.proxy.host = null
2006-08-01 19:58:37,599 INFO  http.Http - http.proxy.port = 8080
2006-08-01 19:58:37,599 INFO  http.Http - http.timeout = 10000
2006-08-01 19:58:37,600 INFO  http.Http - http.content.limit = 65536
2006-08-01 19:58:37,600 INFO  http.Http - http.agent = siBot/siBot-0.1
(http://www.foo.com/; info@foo.com)
2006-08-01 19:58:37,600 INFO  http.Http - fetcher.server.delay = 5000
2006-08-01 19:58:37,600 INFO  http.Http - http.max.delays = 100
2006-08-01 19:58:38,103 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2006-08-01 19:58:38,145 INFO  fetcher.Fetcher - fetching
http://www.foo.com/izobrazevanje.php
2006-08-01 19:58:43,569 INFO  fetcher.Fetcher - fetching
http://www.foo.com/kontakti.php
2006-08-01 19:58:48,624 INFO  fetcher.Fetcher - fetching
http://www.foo.com/portfolio_mailing.php
2006-08-01 19:58:53,553 INFO  fetcher.Fetcher - fetching
http://www.foo.com/online_katalogi.php
2006-08-01 19:58:58,597 INFO  fetcher.Fetcher - fetching
http://www.foo.com/postavitev_sistemov.php
2006-08-01 19:59:03,592 INFO  fetcher.Fetcher - fetching
http://www.foo.com/internet_aplikacije.php
2006-08-01 19:59:08,655 INFO  fetcher.Fetcher - fetching
http://www.foo.com/gradivo.php

ATB,
Vasja

Stefan Groschupf wrote:
> Check:
> http://issues.apache.org/jira/browse/NUTCH-233
> and let us know if it helps.
> Stefan
>
>
> Am 31.07.2006 um 07:46 schrieb Matthew Holt:
>
>> Fetcher for one, and the mapreduce takes forever... IE the mapreduce 
>> is kind of annoying... is it possible to disable it if I'm not 
>> running on a DFS?
>> Matt
>>
>> 06/07/25 20:59:12 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:14 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:19 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:23 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:29 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:33 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:34 INFO mapred.JobClient:  map 100%  reduce 96%
>> 06/07/25 20:59:40 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:41 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:42 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:47 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:48 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:52 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:53 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:00:05 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:00:22 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:00:29 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:00:39 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:01:07 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:01:08 INFO mapred.JobClient:  map 100%  reduce 97%
>> 06/07/25 21:01:16 INFO mapred.LocalJobRunner: reduce > reduce
>>
>>
>> Sami Siren wrote:
>>> Are you experiencing slowness in general or just on some parts of 
>>> the process.
>>>
>>> Current fetcher is deadslow and it should be given immediate 
>>> attention. there have been some talk about the issue but I havent 
>>> seen any code yet.
>>>
>>> -- Sami Siren
>>>
>>> Matthew Holt wrote:
>>>> I agree. Is there anyway to disable something to speed it up? IE is 
>>>> the map reduce currently needed if we're not on a DFS?
>>>>
>>>> Matt
>>>>
>>>> Vasja Ocvirk wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm wondering if anyone can help. We injected 1000 seed URLs into 
>>>>> Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and 
>>>>> it processed them in just few hours. We just switched to 0.8 with 
>>>>> same configuration, same URLs, but it seems everything slowed down 
>>>>> significantly. Crawl script has 60 threads -- same as before but 
>>>>> now it works much slower.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Best,
>>>>> Vasja
>>>>>
>>>>> __________ NOD32 1.1533 (20060512) Information __________
>>>>>
>>>>> This message was checked by NOD32 antivirus system.
>>>>> http://www.eset.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
> __________ NOD32 1.1533 (20060512) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>

Re: 0.8 much slower than 0.7

Posted by Stefan Groschupf <sg...@media-style.com>.
Check:
http://issues.apache.org/jira/browse/NUTCH-233
and let us know if it helps.
Stefan


Am 31.07.2006 um 07:46 schrieb Matthew Holt:

> Fetcher for one, and the mapreduce takes forever... IE the  
> mapreduce is kind of annoying... is it possible to disable it if  
> I'm not running on a DFS?
> Matt
>
> 06/07/25 20:59:12 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:14 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:19 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:23 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:29 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:33 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:34 INFO mapred.JobClient:  map 100%  reduce 96%
> 06/07/25 20:59:40 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:41 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:42 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:47 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:48 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:52 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 20:59:53 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 21:00:05 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 21:00:22 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 21:00:29 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 21:00:39 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 21:01:07 INFO mapred.LocalJobRunner: reduce > reduce
> 06/07/25 21:01:08 INFO mapred.JobClient:  map 100%  reduce 97%
> 06/07/25 21:01:16 INFO mapred.LocalJobRunner: reduce > reduce
>
>
> Sami Siren wrote:
>> Are you experiencing slowness in general or just on some parts of  
>> the process.
>>
>> Current fetcher is deadslow and it should be given immediate  
>> attention. there have been some talk about the issue but I havent  
>> seen any code yet.
>>
>> -- 
>>  Sami Siren
>>
>> Matthew Holt wrote:
>>> I agree. Is there anyway to disable something to speed it up? IE  
>>> is the map reduce currently needed if we're not on a DFS?
>>>
>>> Matt
>>>
>>> Vasja Ocvirk wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm wondering if anyone can help. We injected 1000 seed URLs  
>>>> into Nutch 0.7.2 (basic configuration + 1000 URLs in regexp  
>>>> filter) and it processed them in just few hours. We just  
>>>> switched to 0.8 with same configuration, same URLs, but it seems  
>>>> everything slowed down significantly. Crawl script has 60  
>>>> threads -- same as before but now it works much slower.
>>>>
>>>> Thanks!
>>>>
>>>> Best,
>>>> Vasja
>>>>
>>>> __________ NOD32 1.1533 (20060512) Information __________
>>>>
>>>> This message was checked by NOD32 antivirus system.
>>>> http://www.eset.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>


Re: 0.8 much slower than 0.7

Posted by Matthew Holt <mh...@redhat.com>.
Fetcher for one, and the mapreduce takes forever... IE the mapreduce is 
kind of annoying... is it possible to disable it if I'm not running on a 
DFS?
Matt

06/07/25 20:59:12 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:14 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:19 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:23 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:29 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:33 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:34 INFO mapred.JobClient:  map 100%  reduce 96%
06/07/25 20:59:40 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:41 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:42 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:47 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:48 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:52 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:53 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:05 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:22 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:29 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:39 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:01:07 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:01:08 INFO mapred.JobClient:  map 100%  reduce 97%
06/07/25 21:01:16 INFO mapred.LocalJobRunner: reduce > reduce


Sami Siren wrote:
> Are you experiencing slowness in general or just on some parts of the 
> process.
>
> Current fetcher is deadslow and it should be given immediate 
> attention. there have been some talk about the issue but I havent seen 
> any code yet.
>
> -- 
>  Sami Siren
>
> Matthew Holt wrote:
>> I agree. Is there anyway to disable something to speed it up? IE is 
>> the map reduce currently needed if we're not on a DFS?
>>
>> Matt
>>
>> Vasja Ocvirk wrote:
>>
>>> Hello,
>>>
>>> I'm wondering if anyone can help. We injected 1000 seed URLs into 
>>> Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and 
>>> it processed them in just few hours. We just switched to 0.8 with 
>>> same configuration, same URLs, but it seems everything slowed down 
>>> significantly. Crawl script has 60 threads -- same as before but now 
>>> it works much slower.
>>>
>>> Thanks!
>>>
>>> Best,
>>> Vasja
>>>
>>> __________ NOD32 1.1533 (20060512) Information __________
>>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>>
>>>
>>>
>>>
>>>
>>
>
>

fetcher improvements (was: Re: 0.8 much slower than 0.7)

Posted by Sami Siren <ss...@gmail.com>.
Stefan Groschupf wrote:
> Hi,
> I have some code using queue based mechanism and java nio.
> In my tests it is 4 times faster than the existing fetcher.
> 
> But:
> + I need to fix some more bugs
> + we need to re factor the robots.txt part since it is not usable  
> outside the http protocols yet.

IMO, also the code for politeness should be taken out from http
and make it protocol independent.

> + the fetcher does not support plug able protocols - only http.
> 
> I see two ways to go.
> Refactor the existing robots txt parser and handle but this is a big  
> change.

We should do refactoring, because it would creatly benefit the current 
fetcher also if we could schedule fetching of robots.txt before we try 
to get the content itself. eg. fetch the first 100's sites robots.txt
and after that start fetching content and unseen robots.txts for sites 
still on queue (just an example).

> Or I may be prefer reimplement robots.txt parsing and handling, this  
> require some more time for me.
> 
> In general we should move this discussion into nutch-dev since there  
> are more site effects we should discuss.

now we have it here.

> The new fetcher should be an alternative and we should not just  remove 
> the old fetcher.

+1

--
  Sami Siren

Re: 0.8 much slower than 0.7

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,
I have some code using queue based mechanism and java nio.
In my tests it is 4 times faster than the existing fetcher.

But:
+ I need to fix some more bugs
+ we need to re factor the robots.txt part since it is not usable  
outside the http protocols yet.
+ the fetcher does not support plug able protocols - only http.

I see two ways to go.
Refactor the existing robots txt parser and handle but this is a big  
change.
Or I may be prefer reimplement robots.txt parsing and handling, this  
require some more time for me.

In general we should move this discussion into nutch-dev since there  
are more site effects we should discuss.
The new fetcher should be an alternative and we should not just  
remove the old fetcher.

Stefan



Am 31.07.2006 um 07:34 schrieb Sami Siren:

> Are you experiencing slowness in general or just on some parts of  
> the process.
>
> Current fetcher is deadslow and it should be given immediate  
> attention. there have been some talk about the issue but I havent  
> seen any code yet.
>
> --
>  Sami Siren
>
> Matthew Holt wrote:
>> I agree. Is there anyway to disable something to speed it up? IE  
>> is the map reduce currently needed if we're not on a DFS?
>> Matt
>> Vasja Ocvirk wrote:
>>> Hello,
>>>
>>> I'm wondering if anyone can help. We injected 1000 seed URLs into  
>>> Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter)  
>>> and it processed them in just few hours. We just switched to 0.8  
>>> with same configuration, same URLs, but it seems everything  
>>> slowed down significantly. Crawl script has 60 threads -- same as  
>>> before but now it works much slower.
>>>
>>> Thanks!
>>>
>>> Best,
>>> Vasja
>>>
>>> __________ NOD32 1.1533 (20060512) Information __________
>>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>>
>>>
>>>
>>>
>>>
>
>


Re: 0.8 much slower than 0.7

Posted by Sami Siren <ss...@gmail.com>.
Are you experiencing slowness in general or just on some parts of the 
process.

Current fetcher is deadslow and it should be given immediate attention. 
there have been some talk about the issue but I havent seen any code yet.

--
  Sami Siren

Matthew Holt wrote:
> I agree. Is there anyway to disable something to speed it up? IE is the 
> map reduce currently needed if we're not on a DFS?
> 
> Matt
> 
> Vasja Ocvirk wrote:
> 
>> Hello,
>>
>> I'm wondering if anyone can help. We injected 1000 seed URLs into 
>> Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and it 
>> processed them in just few hours. We just switched to 0.8 with same 
>> configuration, same URLs, but it seems everything slowed down 
>> significantly. Crawl script has 60 threads -- same as before but now 
>> it works much slower.
>>
>> Thanks!
>>
>> Best,
>> Vasja
>>
>> __________ NOD32 1.1533 (20060512) Information __________
>>
>> This message was checked by NOD32 antivirus system.
>> http://www.eset.com
>>
>>
>>
>>
>>
> 


Re: 0.8 much slower than 0.7

Posted by Matthew Holt <mh...@redhat.com>.
I agree. Is there anyway to disable something to speed it up? IE is the 
map reduce currently needed if we're not on a DFS?

Matt

Vasja Ocvirk wrote:
> Hello,
>
> I'm wondering if anyone can help. We injected 1000 seed URLs into 
> Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and it 
> processed them in just few hours. We just switched to 0.8 with same 
> configuration, same URLs, but it seems everything slowed down 
> significantly. Crawl script has 60 threads -- same as before but now 
> it works much slower.
>
> Thanks!
>
> Best,
> Vasja
>
> __________ NOD32 1.1533 (20060512) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>
>
>