You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Peter Veentjer - Anchor Men <p....@anchormen.nl> on 2005/09/05 12:01:18 UTC

fetcher question: why multithreaded?

Hi,
 
I`m looking at the code of the fetcher and have the following question:
why does the fetcher do more than fetching? Wouldn`t it be better te 
move the page parsing to another component and let the fetcher only
fetch?
(so the fetch threads only do fetching).
 
Another problem with this threaded approach is that you need a lot
of threads because a single thread is responsible for retrieving data 
and also for parsing it. If you remove the parsing part, a thread would 
only be responsible for fetching. And this makes it possible to use 
a single thread in the Fetcher that gathers data from a lot of sockets 
(and this reduces context switching overhead). This is a technique 
widely used in search engines and I`m curious about why Nutch 
goes for a different approach.
 
 
 
 

Met vriendelijke groet,

Peter Veentjer
Anchor Men Interactive Solutions - duidelijk in zakelijke
internetoplossingen

Praediniussingel 41
9711 AE Groningen

T: 050-3115222
F: 050-5891696
E: p.veentjer@anchormen.nl
I : www.anchormen.nl <blocked::http://www.anchormen.nl/> 

 

RE: fetcher question: why multithreaded?

Posted by EM <em...@cpuedge.com>.
I'm currently fetching with 35 threads. The CPU load is about 5-10% (P4 3.0
HT). Parsing obviously isn't using many resources.

Removing parsing also would not speed up the fetching process. If parsing
(while fetching) is removed (with a command line argument), I'll probably
tune the fetcher down to 30 threads and have the same overall fetching
speed.



-----Original Message-----
From: Peter Veentjer - Anchor Men [mailto:p.veentjer@anchormen.nl] 
Sent: Monday, September 05, 2005 6:01 AM
To: nutch-dev@lucene.apache.org
Subject: fetcher question: why multithreaded?

Hi,
 
I`m looking at the code of the fetcher and have the following question:
why does the fetcher do more than fetching? Wouldn`t it be better te 
move the page parsing to another component and let the fetcher only
fetch?
(so the fetch threads only do fetching).
 
Another problem with this threaded approach is that you need a lot
of threads because a single thread is responsible for retrieving data 
and also for parsing it. If you remove the parsing part, a thread would 
only be responsible for fetching. And this makes it possible to use 
a single thread in the Fetcher that gathers data from a lot of sockets 
(and this reduces context switching overhead). This is a technique 
widely used in search engines and I`m curious about why Nutch 
goes for a different approach.
 
 
 
 

Met vriendelijke groet,

Peter Veentjer
Anchor Men Interactive Solutions - duidelijk in zakelijke
internetoplossingen

Praediniussingel 41
9711 AE Groningen

T: 050-3115222
F: 050-5891696
E: p.veentjer@anchormen.nl
I : www.anchormen.nl <blocked::http://www.anchormen.nl/>