You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2009/08/26 16:34:22 UTC

Re: Is Nutch purposely slowing down the crawl, or is it just really really inefficient?

Hi Paul,

Yes, Nutch does in fact slow down the crawl to be considered "polite".

wget doesn't.

If the sites you are crawling are under your control, or you have an  
understanding with the site ops people, then you can alter Nutch's  
default settings to make it run at near full speed.

-- Ken


On Aug 26, 2009, at 6:55am, Paul Tomblin wrote:

> I'm trying to crawl three tiny little sites with Nutch, and it takes
> 45 minutes.  To copy the same files to my local hard drive using wget
> takes between 35 seconds and a minute.  What is Nutch doing that
> causes it to take 45 times as long?
>
> -- 
> http://www.linkedin.com/in/paultomblin

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


RE: Is Nutch purposely slowing down the crawl, or is it just really really inefficient?

Posted by Fuad Efendi <fu...@efendi.ca>.
Only if website provides "Last-Modified: " "Etag: " response headers for
initial retrieval, and only if it can understand "If-Modified-Since" request
header of Nutch... However, even in this case Nuth must be polite and not
use frequent requests against same site, even with "If-Modified-Since"
request headers, - each HTTP request is logged, fresh response might be sent
indeed instead of 304, and each one (even 304) needs to establish TCP/IP,
Client-Thread, CPU, etc. - to use server-side resources.


You can have several threads concurrently accessing same website with crawl
delay 0 only and only if this website is fully under your control and
permission.


-Fuad
http://www.linkedin.com/in/liferay
http://www.tokenizer.org



-----Original Message-----
From: ptomblin@gmail.com [mailto:ptomblin@gmail.com] On Behalf Of Paul
Tomblin
Sent: August-26-09 1:36 PM
To: nutch-user@lucene.apache.org
Subject: Re: Is Nutch purposely slowing down the crawl, or is it just really
really inefficient?

On Wed, Aug 26, 2009 at 1:32 PM, MilleBii<mi...@gmail.com> wrote:
> beware you could create a kind of Denial of Service attack if you search
the
> site too quickly.
>

Well, since I fixed Nutch so that it understands "Last-Modified" and
"If-Modified-Since", it won't be downloading the pages 99% of the
time, just getting a "304" response.

-- 
http://www.linkedin.com/in/paultomblin



Re: Is Nutch purposely slowing down the crawl, or is it just really really inefficient?

Posted by Paul Tomblin <pt...@xcski.com>.
On Wed, Aug 26, 2009 at 1:32 PM, MilleBii<mi...@gmail.com> wrote:
> beware you could create a kind of Denial of Service attack if you search the
> site too quickly.
>

Well, since I fixed Nutch so that it understands "Last-Modified" and
"If-Modified-Since", it won't be downloading the pages 99% of the
time, just getting a "304" response.

-- 
http://www.linkedin.com/in/paultomblin

Re: Is Nutch purposely slowing down the crawl, or is it just really really inefficient?

Posted by MilleBii <mi...@gmail.com>.
beware you could create a kind of Denial of Service attack if you search the
site too quickly.

2009/8/26 Paul Tomblin <pt...@xcski.com>

> On Wed, Aug 26, 2009 at 10:55 AM, Kirby Bohling<ki...@gmail.com>
> wrote:
> > Paul,
> >
> >   I'd read the nutch-default.xml file, I believe the properties you'd
> > like to examine start in the section labelled <!-- fetcher properties
> > -->
> >
> > fetcher.threads.per.host
> > fetcher.server.delay
> > fetcher.server.min.delay
> > fetcher.max.crawl.delay
> >
>
> Ok, that is interesting.  Since I'm only crawling one site at a time,
> I guess there is no point having more threads that
> "fetcher.threads.per.host" - or more to the point, I should increase
> fetcher.threads.per.host up to the number of threads.
>
>
> --
> http://www.linkedin.com/in/paultomblin
>



-- 
-MilleBii-

Re: Is Nutch purposely slowing down the crawl, or is it just really really inefficient?

Posted by Paul Tomblin <pt...@xcski.com>.
On Wed, Aug 26, 2009 at 10:55 AM, Kirby Bohling<ki...@gmail.com> wrote:
> Paul,
>
>   I'd read the nutch-default.xml file, I believe the properties you'd
> like to examine start in the section labelled <!-- fetcher properties
> -->
>
> fetcher.threads.per.host
> fetcher.server.delay
> fetcher.server.min.delay
> fetcher.max.crawl.delay
>

Ok, that is interesting.  Since I'm only crawling one site at a time,
I guess there is no point having more threads that
"fetcher.threads.per.host" - or more to the point, I should increase
fetcher.threads.per.host up to the number of threads.


-- 
http://www.linkedin.com/in/paultomblin

Re: Is Nutch purposely slowing down the crawl, or is it just really really inefficient?

Posted by Kirby Bohling <ki...@gmail.com>.
On Wed, Aug 26, 2009 at 9:43 AM, Paul Tomblin<pt...@xcski.com> wrote:
> On Wed, Aug 26, 2009 at 10:34 AM, Ken
> Krugler<kk...@transpac.com> wrote:
>> If the sites you are crawling are under your control, or you have an
>> understanding with the site ops people, then you can alter Nutch's default
>> settings to make it run at near full speed.
>
> What settings would those be?  I tried increasing the number of
> threads from 10 to 125, but it had absolutely no discernible effect on
> the crawl speed.
>

Paul,

   I'd read the nutch-default.xml file, I believe the properties you'd
like to examine start in the section labelled <!-- fetcher properties
-->

fetcher.threads.per.host
fetcher.server.delay
fetcher.server.min.delay
fetcher.max.crawl.delay


I'm guessing there are others but those 4 looked like they were most
closely related.  Spending a bit of time reading the descriptions in
conf/nutch-default.xml is very helpful for tracking these things down.
 Override those values in conf/nutch-site.xml, don't directly change
the nutch-default.xml (at least that's what everything I've read
recommends).

Thanks,
    Kirby


> --
> http://www.linkedin.com/in/paultomblin
>

Re: Is Nutch purposely slowing down the crawl, or is it just really really inefficient?

Posted by Paul Tomblin <pt...@xcski.com>.
On Wed, Aug 26, 2009 at 10:34 AM, Ken
Krugler<kk...@transpac.com> wrote:
> If the sites you are crawling are under your control, or you have an
> understanding with the site ops people, then you can alter Nutch's default
> settings to make it run at near full speed.

What settings would those be?  I tried increasing the number of
threads from 10 to 125, but it had absolutely no discernible effect on
the crawl speed.

-- 
http://www.linkedin.com/in/paultomblin