You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/02/20 17:10:10 UTC

Excessive retries

Hi,

We're finding that we've got one or two domains that are providing 
excessive retries - and that's drastically slowing our fetch process 
down by hours. 

Any general guidance on how to fix the problem?  we've upped our max 
retries variable to 3 from 1 I believe, still getting the problem.  
Here's some example URL's:

http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-D426F221/ama/web/travel_Group-Travel.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-CEF90BB0/ama/web/everything_auto_driver_ed.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-DEA4DDC2/ama/web/everything_auto_Vehicle-Safety.htm
http://www.plentyoffish.com/personals/3147onlinedating.htm
http://www.plentyoffish.com/personals/1032onlinedating27.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-CBE7B5E0/ama/web/insurance_Insurance-News.htm

Also, it seems like we're trying to access 100's of thousands of pages from some of these domains - shouldn't it be limiting the number of pages from a specific url?  (I guess that's two questions).

Off hand, it looks like we've got a session variable in there.  My first guess is that somehow those may be part of the problem.  These two domains are making up something like 80-90% of our retries. Clearly we need to stop the excessive retries, and at the same time be a bit more polite with those domains.

Thanks,
Glenn