You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/02/20 17:10:10 UTC
Excessive retries
Hi,
We're finding that we've got one or two domains that are providing
excessive retries - and that's drastically slowing our fetch process
down by hours.
Any general guidance on how to fix the problem? we've upped our max
retries variable to 3 from 1 I believe, still getting the problem.
Here's some example URL's:
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-D426F221/ama/web/travel_Group-Travel.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-CEF90BB0/ama/web/everything_auto_driver_ed.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-DEA4DDC2/ama/web/everything_auto_Vehicle-Safety.htm
http://www.plentyoffish.com/personals/3147onlinedating.htm
http://www.plentyoffish.com/personals/1032onlinedating27.htm
http://www.ama.ab.ca/cps/rde/xchg/SID-53ED365B-CBE7B5E0/ama/web/insurance_Insurance-News.htm
Also, it seems like we're trying to access 100's of thousands of pages from some of these domains - shouldn't it be limiting the number of pages from a specific url? (I guess that's two questions).
Off hand, it looks like we've got a session variable in there. My first guess is that somehow those may be part of the problem. These two domains are making up something like 80-90% of our retries. Clearly we need to stop the excessive retries, and at the same time be a bit more polite with those domains.
Thanks,
Glenn