You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Byron Miller <by...@yahoo.com> on 2006/02/22 18:09:32 UTC

.8 svn - fetcher performance..

Anything i should change/tweak on my fetcher config
for .8 release? i'm only getting 5 pages/sec and i was
getting nearly 50 on .7 with 125 threads going.  Does
.8 not use threads like 7 did?

I believe i'm just using the standard protocol-http
support, not http-client

Re: .8 svn - fetcher performance..

Posted by Doug Cutting <cu...@apache.org>.

Byron Miller wrote:
> Anything i should change/tweak on my fetcher config
> for .8 release? i'm only getting 5 pages/sec and i was
> getting nearly 50 on .7 with 125 threads going.  Does
> .8 not use threads like 7 did?

Byron,

Have you tried again more recently?  A number of bugs have been fixed in 
0.8 in the past few weeks.  I think it is now much more stable.

Doug

Re: .8 svn - fetcher performance..

Posted by Zaheed Haque <za...@gmail.com>.

Ken:

Thank you very much for the info, I applied it my testing enviornment
and I could see big changes in my bandwidth utilization. I have tried
it on a simple server and i could get a rather constant 25-29
pages/sec in a vertical crawl. Previously I was getting about 5-7
pages/sec.

Cheers
Zaheed


On 7/11/06, Ken Krugler <kk...@transpac.com> wrote:
> >On 6/28/06, Ken Krugler <kk...@transpac.com> wrote:
> >>Hi Doug,
> >>
> >>>Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
> >>>running into a similar problem.
> >>
> >>We wound up dramatically increasing the number of threads, which
> >>seemed to help solve the bandwidth utilization problem. With Nutch
> >>0.7 we were running about 200 threads per crawler, and with Nutch 0.8
> >>it's more like 2000+ threads...though you have to reduce the thread
> >>stack size in this type of configuration.
> >
> >Hi Ken
> >
> >Could you please give me some clue regarding the stack size you are
> >seeing the best bandwidth utilization...
>
> Note that stack size twiddling is only done to allow for increasing
> the number of fetcher threads without running of out JVM or OS memory.
>
> >  I have the following
> >
> >core file size          (blocks, -c) 0
> >data seg size           (kbytes, -d) unlimited
> >max nice                        (-e) 20
> >file size               (blocks, -f) unlimited
> >pending signals                 (-i) unlimited
> >max locked memory       (kbytes, -l) unlimited
> >max memory size         (kbytes, -m) unlimited
> >open files                      (-n) 1024
> >pipe size            (512 bytes, -p) 8
> >POSIX message queues     (bytes, -q) unlimited
> >max rt priority                 (-r) unlimited
> >stack size              (kbytes, -s) 8192
> >cpu time               (seconds, -t) unlimited
> >max user processes              (-u) unlimited
> >virtual memory          (kbytes, -v) unlimited
> >file locks                      (-x) unlimited
> >
> >What stack size should I play with the default seems to be 8192kb ?
>
> We use something like ulimit -s 512 to set a 512K stack size at the OS level.
>
> >also any onther parameters I should tweak?
>
> We specify -Xss512K when running the fetch map-reduce task to set the
> stack size in the JVM. But I don't remember off the top of my head
> which of the many different config files this gets set in. Stefan?
> >
> >I often get too many open
> >files problem
>
> That's a separate issue.
>
> >and I never could use my full bandwidth.. I am using
> >about 10% of my bandwidth. I have played around with ulimit -n "very
> >high number" which solves the "too many open files" but its not
> >utilizing all my bandwidth, any help will be very much appreciated.
>
> Try increasing the number of fetcher threads and reducing the stack
> size. With 10 high-end servers in a cluster, we were able to max out
> a 100mbs connection for brief periods, though as our crawl converged
> (because it's a vertical crawl) the max rate drops eventually to
> about 50mps.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>

Re: .8 svn - fetcher performance..

Posted by Ken Krugler <kk...@transpac.com>.

>On 6/28/06, Ken Krugler <kk...@transpac.com> wrote:
>>Hi Doug,
>>
>>>Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
>>>running into a similar problem.
>>
>>We wound up dramatically increasing the number of threads, which
>>seemed to help solve the bandwidth utilization problem. With Nutch
>>0.7 we were running about 200 threads per crawler, and with Nutch 0.8
>>it's more like 2000+ threads...though you have to reduce the thread
>>stack size in this type of configuration.
>
>Hi Ken
>
>Could you please give me some clue regarding the stack size you are
>seeing the best bandwidth utilization...

Note that stack size twiddling is only done to allow for increasing 
the number of fetcher threads without running of out JVM or OS memory.

>  I have the following
>
>core file size          (blocks, -c) 0
>data seg size           (kbytes, -d) unlimited
>max nice                        (-e) 20
>file size               (blocks, -f) unlimited
>pending signals                 (-i) unlimited
>max locked memory       (kbytes, -l) unlimited
>max memory size         (kbytes, -m) unlimited
>open files                      (-n) 1024
>pipe size            (512 bytes, -p) 8
>POSIX message queues     (bytes, -q) unlimited
>max rt priority                 (-r) unlimited
>stack size              (kbytes, -s) 8192
>cpu time               (seconds, -t) unlimited
>max user processes              (-u) unlimited
>virtual memory          (kbytes, -v) unlimited
>file locks                      (-x) unlimited
>
>What stack size should I play with the default seems to be 8192kb ?

We use something like ulimit -s 512 to set a 512K stack size at the OS level.

>also any onther parameters I should tweak?

We specify -Xss512K when running the fetch map-reduce task to set the 
stack size in the JVM. But I don't remember off the top of my head 
which of the many different config files this gets set in. Stefan?
>
>I often get too many open
>files problem

That's a separate issue.

>and I never could use my full bandwidth.. I am using
>about 10% of my bandwidth. I have played around with ulimit -n "very
>high number" which solves the "too many open files" but its not
>utilizing all my bandwidth, any help will be very much appreciated.

Try increasing the number of fetcher threads and reducing the stack 
size. With 10 high-end servers in a cluster, we were able to max out 
a 100mbs connection for brief periods, though as our crawl converged 
(because it's a vertical crawl) the max rate drops eventually to 
about 50mps.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: .8 svn - fetcher performance..

Posted by Zaheed Haque <za...@gmail.com>.

On 6/28/06, Ken Krugler <kk...@transpac.com> wrote:
> Hi Doug,
>
> >Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
> >running into a similar problem.
>
> We wound up dramatically increasing the number of threads, which
> seemed to help solve the bandwidth utilization problem. With Nutch
> 0.7 we were running about 200 threads per crawler, and with Nutch 0.8
> it's more like 2000+ threads...though you have to reduce the thread
> stack size in this type of configuration.

Hi Ken

Could you please give me some clue regarding the stack size you are
seeing the best bandwidth utilization... I have the following

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
max rt priority                 (-r) unlimited
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

What stack size should I play with the default seems to be 8192kb ?
also any onther parameters I should tweak? I often get too many open
files problem and I never could use my full bandwidth.. I am using
about 10% of my bandwidth. I have played around with ulimit -n "very
high number" which solves the "too many open files" but its not
utilizing all my bandwidth, any help will be very much appreciated.

Thanks
Zaheed


> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>

Re: deleting URL duplicates - never actually deleted?

Posted by Marko Bauhardt <mb...@media-style.com>.

>
> # De-duplicate indexes
> # "bogus" argument is ignored but needed due to
> # a bug in the number of args expected
> bin/nutch dedup crawl/segments bogus
>

The dedup command works only on many indexes and not on one or many  
segments. The directory structure of an index looks like:
index/part-00000/SOME_LUCENE_FILES

Here is an example how is the structure of an crawl:
crawl/segments/20060702232437
crawl/segments/20060702233040
crawl/linkdb
crawl/indexes //this is the index of the two segments

Now you can run dedup: bin/nutch dedup crawl/indexes

If you run dedup on a folder which contains segments, an exception  
should be thrown. Look at your logfiles and verify that the dedup  
process runs whithout exeptions.

Marko

Re: deleting URL duplicates - never actually deleted?

Posted by Honda-Search Administrator <ad...@honda-search.com>.

Marko,

Currently the shell command is as follows:

---
# index new segment
bin/nutch index $s1

# update the database
bin/nutch updatedb crawl/db $s1

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup crawl/segments bogus

# Merge indexes
ls -d crawl/segments/* | xargs bin/nutch merge crawl/index
---

Should I actually switch the last two commands around?

Matt 

Original Message ----- 
From: "Marko Bauhardt" <mb...@media-style.com>
To: <nu...@lucene.apache.org>
Sent: Friday, June 30, 2006 2:57 AM
Subject: Re: deleting URL duplicates - never actually deleted?


> 
>  Do you delete the duplicates before you merge the index? Run first  
> the merge command and then the dedup command.
> 
> But a better way is you create one index of all segments with the  
> index command and then runs the dedup command of this one index.
> 
> Hope this Helps,
> Marko
> 
> 
> Am 29.06.2006 um 23:07 schrieb Honda-Search Administrator:
> 
>> Maybe someone can explain to me how this works.
>>
>> First, my setup.
>>
>> I create a fetchlist each night with FreeFetchlistTool and fetch  
>> those pages.  It often contains the same URLS that are already in  
>> the database, but this tool gets the newest copies of those URLs.
>>
>> I also run nutch dedup after everything is fetched, indexed, etc.   
>> I then merge the segments using the following command:
>>
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> Every night the number of "duplicates" increases.  THis is so  
>> because the duplicates from the day before are not actually deleted  
>> (I assume).
>>
>> Is dedup removing them from some sort of master index and the  
>> segments retain their original information?
>>
>> If so, is there a way to merge the segments into one (or whatever)  
>> so that duplicate URLs do not exist?  Would mergesegs do this?
>>
>> Thanks for any help, and I hope my questionis clear.
>>
>> Matt
>>
>>
> 
> 
>

Re: deleting URL duplicates - never actually deleted?

Posted by Marko Bauhardt <mb...@media-style.com>.

  Do you delete the duplicates before you merge the index? Run first  
the merge command and then the dedup command.

But a better way is you create one index of all segments with the  
index command and then runs the dedup command of this one index.

Hope this Helps,
Marko


Am 29.06.2006 um 23:07 schrieb Honda-Search Administrator:

> Maybe someone can explain to me how this works.
>
> First, my setup.
>
> I create a fetchlist each night with FreeFetchlistTool and fetch  
> those pages.  It often contains the same URLS that are already in  
> the database, but this tool gets the newest copies of those URLs.
>
> I also run nutch dedup after everything is fetched, indexed, etc.   
> I then merge the segments using the following command:
>
> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>
> Every night the number of "duplicates" increases.  THis is so  
> because the duplicates from the day before are not actually deleted  
> (I assume).
>
> Is dedup removing them from some sort of master index and the  
> segments retain their original information?
>
> If so, is there a way to merge the segments into one (or whatever)  
> so that duplicate URLs do not exist?  Would mergesegs do this?
>
> Thanks for any help, and I hope my questionis clear.
>
> Matt
>
>

deleting URL duplicates - never actually deleted?

Posted by Honda-Search Administrator <ad...@honda-search.com>.

Maybe someone can explain to me how this works.

First, my setup.

I create a fetchlist each night with FreeFetchlistTool and fetch those 
pages.  It often contains the same URLS that are already in the database, 
but this tool gets the newest copies of those URLs.

I also run nutch dedup after everything is fetched, indexed, etc.  I then 
merge the segments using the following command:

ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

Every night the number of "duplicates" increases.  THis is so because the 
duplicates from the day before are not actually deleted (I assume).

Is dedup removing them from some sort of master index and the segments 
retain their original information?

If so, is there a way to merge the segments into one (or whatever) so that 
duplicate URLs do not exist?  Would mergesegs do this?

Thanks for any help, and I hope my questionis clear.

Matt

Re: .8 svn - fetcher performance..

Posted by TDLN <di...@gmail.com>.

> Ok, this isn't true because it is sorted by using HashComparator. For
> some reason
> generated list contains some parts wich are more or less sorted by host
> and some
> parts looks more "random".

This is consistent with what I am seeing; the Fetcher slowing down for
a while, sometimes coming to a virtual halt (a lot of repeated fetch
speed statements, for instance "412052 pages, 95915 errors, 3.0
pages/s, 638 kb/s,"), then the Fetcher speeding up again and fetching
at acceptable fetch speeds.

Rgrds, Thomas

Re: .8 svn - fetcher performance..

Posted by Sami Siren <ss...@gmail.com>.

>>
>> Fetchlist seems to be sorted by url.This leads to many threads being
>
Ok, this isn't true because it is sorted by using HashComparator. For 
some reason
generated list contains some parts wich are more or less sorted by host 
and some
parts looks more "random".

--
 Sami Siren

Re: .8 svn - fetcher performance..

Posted by TDLN <di...@gmail.com>.

+1 for a solution to this pressing issue!

I am seeing the same problem, in my case two symptoms:

1) low fetch speeds
2) crawls end "before their time" with "aborting with xxx hung
threads" error message

I am doing a focussed crawl on about 70.000 domains.
crawl.ignore.external.links is set to true.

In previous discussions on the list these issues have mainly been
attributed to crawls on such a limited set of domains.

See if I understand this correcly. FetchLists are hostwise disjoint,
thus all urls from the same domain are in the same FetchList. Folks
*not* on MapReduce are by definition always working with one Fetcher.
Otherwise could be many, in which case this mechanism prevents the
politeness rules from being disobeyed.

Could somebody confirm these assumptions are correct?

I have tried to work around the issues by changing the configuration.
I tried increasing fetcher.threads.fetch, http.timeout and
http.max.delays.

I also changed generate.max.per.host setting, following Doug's advice
of setting this value to TopN / Fetcher Threads, all to no lasting
avail.

So far, I haven't tried increasing the fetcher.threads.per.host to
more than 4 with 100 threads, though. I will do that now.

I really think we should gather some more data regarding fetch speed
problems. Maybe some of you who are seeing decent fetch speeds in a
focussed crawl setup could share some of their tips in tuning the
installation.

Thanks a lot for your time if you read so far :)

Rgrds, Thomas Delnoij

On 6/28/06, Sami Siren <ss...@gmail.com> wrote:
> Ken Krugler wrote:
>
> > Hi Doug,
> >
> >> Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
> >> running into a similar problem.
> >
> >
> > We wound up dramatically increasing the number of threads, which
> > seemed to help solve the bandwidth utilization problem. With Nutch 0.7
> > we were running about 200 threads per crawler, and with Nutch 0.8 it's
> > more like 2000+ threads...though you have to reduce the thread stack
> > size in this type of configuration.
> >
> Fetchlist seems to be sorted by url.This leads to many threads being
> blocked when crawler is configured to fetch by a low number of threads
> per host (default 1) and there are several urls from same host in the
> fetchlist.
>
> This could perhaps be improved by sorting by some other key?
>
> --
>  Sami Siren
>
>
>
>

Re: .8 svn - fetcher performance..

Posted by Sami Siren <ss...@gmail.com>.

Ken Krugler wrote:

> Hi Doug,
>
>> Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
>> running into a similar problem.
>
>
> We wound up dramatically increasing the number of threads, which 
> seemed to help solve the bandwidth utilization problem. With Nutch 0.7 
> we were running about 200 threads per crawler, and with Nutch 0.8 it's 
> more like 2000+ threads...though you have to reduce the thread stack 
> size in this type of configuration.
>
Fetchlist seems to be sorted by url.This leads to many threads being 
blocked when crawler is configured to fetch by a low number of threads 
per host (default 1) and there are several urls from same host in the 
fetchlist.

This could perhaps be improved by sorting by some other key?

--
 Sami Siren

Re: .8 svn - fetcher performance..

Posted by Ken Krugler <kk...@transpac.com>.

Hi Doug,

>Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
>running into a similar problem.

We wound up dramatically increasing the number of threads, which 
seemed to help solve the bandwidth utilization problem. With Nutch 
0.7 we were running about 200 threads per crawler, and with Nutch 0.8 
it's more like 2000+ threads...though you have to reduce the thread 
stack size in this type of configuration.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: .8 svn - fetcher performance..

Posted by Doug Cook <na...@candiru.com>.

Byron, 

Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
running into a similar problem.
-- 
View this message in context: http://www.nabble.com/.8-svn----fetcher-performance..-tf1170232.html#a5076764
Sent from the Nutch - User forum at Nabble.com.