You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by og...@yahoo.com on 2008/04/21 22:40:04 UTC

Re: Fetching inefficiency

Adding some comments to the email below, but here on nutch-dev.

Basically, it is my feeling that whenever fetchlists (and its parts) are not "well balanced", this inefficiency will be seen.
Concretely, whichever task is "stuck fetching from the slow server with a lot of its pages in the fetchlist", it will prolong the whole fetch job.  Slow server with lots of pages is a bad combination, and I see that a lot.  Perhaps it's the nature of my crawl - it is constrained, not web-side, with the number of distinct hosts is around 15-20K?

* Example fetchlist part:
slow.com/1
fast.com/1
ok.com/1
slow.com/2
fast.com/2
ok.com/2
ok.com/3
slow.com/3
slow.com/4
slow.com/5
slow.com/6

* The above fetchlist part will take a lot longer than this one:
speedy.com/1
speedy.com/2
speedy.com/3
speedy.com/4
superspeedy.com/1
ok2.com/1
ok2.com/2
speedy.com/5
speedy.com/6
speedy.com/7
ok2.com/3
speedy.com/8

The task processing the first set of URLs will be slower because it got the slow slow.com server and slow.com happens to have a lot of pages in that fetchlist part.  The task processing the second set of URLs will be quick, since all its servers are pretty fast.

Some questions:
Are there ways around this?
Are others not seeing the same behaviour?
Is this just the nature of my crawl - constrained and with only 15-20K unique servers?

If others are seeing this behaviour, then I'm wondering if others have any thoughts about improving this either before 1.0 or after 1.0 release?  For instance, maybe things would be better with that HostDb and a Generator that knows not to produce fetchlists with lots of URLs from slow servers?  Or maybe there is a way to keep feeding Fetchers with URLs from other sites, so its idle threads can be kept busy instead of in spinWait status?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "ogjunk-nutch@yahoo.com" <og...@yahoo.com>
> To: Nutch User List <nu...@lucene.apache.org>
> Sent: Monday, April 21, 2008 4:16:24 PM
> Subject: Fetching inefficiency
> 
> Hello,
> 
> I am wondering how others deal with the following, which I see as fetching 
> inefficiency:
> 
> 
> When fetching, the fetchlist is broken up into multiple parts and fetchers on 
> cluster nodes start fetching.  Some fetchers end up fetching from fast servers, 
> and some from very very slow servers.  Those fetching from slow servers take a 
> long time to complete and prolong the whole fetching process.  For instance, 
> I've seen tasks from the same fetch job finish in only 1-2 hours, and others in 
> 10 hours.  Those taking 10 hours were stuck fetching pages from a single or 
> handful of slow sites.  If you have two nodes doing the fetching and one is 
> stuck with a slow server, the other one is idling and wasting time.  The node 
> stuck with the slow server is also underutilized, as it's slowly fetching from 
> only 1 server instead of many.
> 
> I imagine anyone using Nutch is seeing the same.  If not, what's the trick?
> 
> I have not tried overlapping fetching jobs yet, but I have a feeling that won't 
> help a ton, plus it could lead to two fetchers fetching from the same server and 
> being impolite - am I wrong?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


Re: Fetching inefficiency

Posted by Ken Krugler <kk...@transpac.com>.
>Adding some comments to the email below, but here on nutch-dev. 
>Basically, it is my feeling that whenever fetchlists (and its parts) 
>are not "well balanced", this inefficiency will be seen. Concretely, 
>whichever task is "stuck fetching from the slow server with a lot of 
>its pages in the fetchlist", it will prolong the whole fetch job. 
>Slow server with lots of pages is a bad combination, and I see that 
>a lot.  Perhaps it's the nature of my crawl - it is constrained, not 
>web-side, with the number of distinct hosts is around 15-20K?

[snip]

>Some questions: Are there ways around this? Are others not seeing 
>the same behaviour? Is this just the nature of my crawl - 
>constrained and with only 15-20K unique servers?

We often ran into the same problem, while doing our vertical tech 
pages crawl - smaller number of unique hosts, and some really slow 
hosts slowing down the entire fetch cycle.

We added code that terminated slow fetches. After fooling around with 
some different approaches, I think we settled on terminating all 
remaining fetches when the number of active fetch threads dropped 
below a threshold set from the total # of threads available. The 
ratio was set to 20% or so.

URLs that were terminated in this manner would get their status set 
to the same as if the page had returned a "temp unavailable" HTTP 
response, IIRC.

This worked pretty well, though we had to hack the httpclient lib 
because even when you interrupted a fetch, there was some cleanup 
code executed during a socket close that would try to empty the 
stream, and for some slow servers this would still cause the fetch to 
hang.

-- Ken


>If others are seeing this behaviour, then I'm wondering if others 
>have any thoughts about improving this either before 1.0 or after 
>1.0 release?  For instance, maybe things would be better with that 
>HostDb and a Generator that knows not to produce fetchlists with 
>lots of URLs from slow servers?  Or maybe there is a way to keep 
>feeding Fetchers with URLs from other sites, so its idle threads can 
>be kept busy instead of in spinWait status? Thanks, Otis -- Sematext 
>-- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original 
>Message ---- > From: "ogjunk-nutch@yahoo.com" 
><og...@yahoo.com> > To: Nutch User List 
><nu...@lucene.apache.org> > Sent: Monday, April 21, 2008 
>4:16:24 PM > Subject: Fetching inefficiency > > Hello, > > I am 
>wondering how others deal with the following, which I see as 
>fetching > inefficiency: > > > When fetching, the fetchlist is 
>broken up into multiple parts and fetchers on > cluster nodes start 
>fetching.  Some fetchers end up fetching from fast servers, > and 
>some from very very slow servers.  Those fetching from slow servers 
>take a > long time to complete and prolong the whole fetching 
>process.  For instance, > I've seen tasks from the same fetch job 
>finish in only 1-2 hours, and others in > 10 hours.  Those taking 10 
>hours were stuck fetching pages from a single or > handful of slow 
>sites.  If you have two nodes doing the fetching and one is > stuck 
>with a slow server, the other one is idling and wasting time.  The 
>node > stuck with the slow server is also underutilized, as it's 
>slowly fetching from > only 1 server instead of many. > > I imagine 
>anyone using Nutch is seeing the same.  If not, what's the 
>trick? > > I have not tried overlapping fetching jobs yet, but I 
>have a feeling that won't > help a ton, plus it could lead to two 
>fetchers fetching from the same server and > being impolite - am I 
>wrong? > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- 
>Lucene - Solr - Nutch


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"