You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jonathan Reichhold <jd...@speakeasy.net> on 2005/11/14 18:40:18 UTC

Fetcher timeout

While testing out Nutch, I've discovered several issues with hangs 
inside of specific parsers, and realized that the Fetcher code has no 
concept of timeout on a thread.  From experience in doing whole web 
crawls, I've found this to be an essential feature for long-term 
stability (read hands-off production crawling for large indices)

As I'm coming into this codebase new, does the idea of a Fetch thread 
timeout exist (not just HTTP timeout) for a bad parser?  If so, how 
would I use set it?  If not, and looking at the code I believe this to 
be true, any issue with adding it?

Saw mentions from Doug Cutting on nutch-general on Oct 29th 2005

"Also, the mapred fetcher has been changed to succeed even when threads  
hang.  Perhaps we should change the 0.7 fetcher similarly?  I think we  
should probably go even farther, and kill threads which take longer than  
a timeout to process a url.  Thread.stop() is theoretically unsafe, but  
I've used it in the past for this sort of thing and never traced  
subsequent problems back to it...  "

Would agree with doug on this being "unsafe" but used it on large 
sites.  At the very least restarting the fetcher (can this be done) 
after this point would help get through the list.

Jonathan Reichhold