You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Danicela nutch <Da...@mail.com> on 2011/10/04 15:39:32 UTC

Re : Re: Fetch performance

Hi,

 I tried both -numFetchers 1 and -numFetchers 4 and both times I had 2 sequential fetches that lasted each 13 minutes.

 Thanks.


----- Message d'origine -----
De : Julien Nioche
Envoyés : 28.09.11 17:16
À : user@nutch.apache.org
Objet : Re: Fetch performance

 Hi, Check the value of the parameter '-numFetchers' when calling generate. l guess you are using a value of 2 in non-distributed mode i.e they are done in sequential order. I'd strongly advise to move to a more recent version of Nutch if you can. There has been a considerable number of improvements added since 1.0 Julien On 28 September 2011 15:50, Danicela nutch <Da...@mail.com> wrote: > Hi, > > My config is : > > Nutch 1.0. > generate.max.per.host = 130 > fetcher.server.delay = 5 > fetcher.threads.fetch = 50 > number of hosts in seeds = 30 > > If the fetch was effective, we would get 130 * 6 (5+1 imprecision) seconds > = 13 min for a fetch. > > According to the results, a fetch lasts 26 minutes. > > When I analyse hadoop.log, I noticed that some sites are fetched during > the 13 first minutes, and the other sites, which weren't fetched until the > 13rd minute, begin to be fetched after the 13rd minute. These sites are > fetched until the 26th minute. > > I can con
 clude that the fetch lasts twice as much time than it should, > because a part of the sites are fetched only after others. (some STATS are > produced between the 2 steps) > > How can we prevent this split ? I mean, how to force all sites to be > fetched from the beginning ? > > Thanks in advance for helping. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com