You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by sdeck <sc...@gmail.com> on 2007/03/06 23:08:34 UTC

Crawl slow on one machine, fast on another

Hey all,
 Just curious if anyone has any ideas on what to test for on this weird
issue for me. On my laptop from home, I have a list of about 140K urls to
inject, and then I get back roughly 12000 urls to retrieve/parse/index. This
takes about an hour. Sometimes less. I have a cable modem at home. Now, when
I then push this exact same code to my hosted server (dual proc, 2 gig ram.
Basically, a real server) it takes roughly 5-6 hours to do the exact same
thing.

Has anyone run into this? Ideas on what I could check to see where the bog
down could be at?
If it was a dns issue, how would I go about checking look up times? (I have
seen this as a possible issue in the forums)

Thanks,
Scott

-- 
View this message in context: http://www.nabble.com/Crawl-slow-on-one-machine%2C-fast-on-another-tf3358653.html#a9342187
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: [SOLVED] Crawl slow on one machine, fast on another

Posted by sdeck <sc...@gmail.com>.
Thanks for the reply.
It is my own server, run at a managed system site (ev1servers)
It is a windows machine, and it also runs my website. Currently, I allocate
800 megs of ram to the crawler to run. 150 fetcher threads.
The cpu usually is pretty low as it runs, it maybe gets to 20% cpu
utilization.

The only reason why I mention dns, is that if the issues with java and dns
were the culprit, many people have found reverse lookups lasting 4-5
seconds. so, 4-5 seconds, over that many urls could increment 4-5X the time
to run the system.  Not sure. Again, shooting in the dark, hopefully I will
hit a caribou.

Oh, no throttling is done either.

As a guess, I am going to try using dnsjava on the app tonight, and see if
that helps at all. Not sure.  Just trying to figure out where to start
trying to debug is the frustrating part. It really could be anything.

Scott

  

Sean Dean-3 wrote:
> 
> There could be many things causing the slowdown on the perceived faster
> server.
>  
> You should be aware that under normal operation Nutch only uses one
> processor, the process can jump around and execute on any of the
> processors you have but never more then one at any time.
>  
> You should watch the output of "top" or something similar to see what the
> slowdown might be, if your running other things on the server that might
> also be the cause. For you to see an increase of 4-5 hours makes it sound
> like its more then just DNS look-ups slowing you down.
>  
> Is your hosted server shared? Do they throttle your connection, processor
> time, or memory?
> 
> 
> ----- Original Message ----
> From: sdeck <sc...@gmail.com>
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, March 6, 2007 5:08:34 PM
> Subject: Crawl slow on one machine, fast on another
> 
> 
> Hey all,
> Just curious if anyone has any ideas on what to test for on this weird
> issue for me. On my laptop from home, I have a list of about 140K urls to
> inject, and then I get back roughly 12000 urls to retrieve/parse/index.
> This
> takes about an hour. Sometimes less. I have a cable modem at home. Now,
> when
> I then push this exact same code to my hosted server (dual proc, 2 gig
> ram.
> Basically, a real server) it takes roughly 5-6 hours to do the exact same
> thing.
> 
> Has anyone run into this? Ideas on what I could check to see where the bog
> down could be at?
> If it was a dns issue, how would I go about checking look up times? (I
> have
> seen this as a possible issue in the forums)
> 
> Thanks,
> Scott
> 
> -- 
> View this message in context:
> http://www.nabble.com/Crawl-slow-on-one-machine%2C-fast-on-another-tf3358653.html#a9342187
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

-- 
View this message in context: http://www.nabble.com/Crawl-slow-on-one-machine%2C-fast-on-another-tf3358653.html#a9343802
Sent from the Nutch - User mailing list archive at Nabble.com.