You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by AJ Chen <ca...@gmail.com> on 2005/09/23 19:40:40 UTC

SocketTimeoutException

In my crawling of large number of selected sites, the number of threads 
is automatically determined by the number of pages on the fetchlist in 
each fetch/updatedb cycle. Max number of threads is set to 1000.  When 
using 1000 threads, there are lots of SocketTimeoutException in fetching 
toward the end of the fetch cycle. Any suggestion for reducing 
SocketTimeoutException?  I also notice that the SocketTimeoutException 
errors are not counted in the error count for segment status. Why is that?

Relevant parameters set:
http.timeout=10000
http.max.delays=100000
fetcher.threads.fetch= from 10 to 1000 depending on the size of fetchlist

Appreciate your help,
AJ


HD question for large DB

Posted by EM <em...@cpuedge.com>.
What would be a good Hard Drive solution for a large DB. Once I get into 
the range of 0.5 - 2 million pages doing anything with the DB becomes 
slow. I have SATA disk with a sustained transfer rate of about 80MB/s.

What would it require if the DB is to contain 500 million pages? RAID? 
RAID of 4 drives, 16 drives? Cheap drives, Large drives?
Can anyone who's been through this give me some pointers?

Please note that I don't need to use the same system for 
indexing/searching, only for DB operations.

Regards,
EM

RE: SocketTimeoutException

Posted by Fuad Efendi <fu...@efendi.ca>.
AJ,
Number of threads 1000 - very good, but... at least existing J2SE from
SUN, it will perform ugly!

Preferabe: 32 Processes, 32 Threads each... At least with "The Grinder"
and 2 Gb of memory (I don't know about Nutch!)... You should really
calculate everything, CPU, memory for each thread, ...
1000 Threads - too much...

P.S.
Increase "timeout", or decrease "threads".


-----Original Message-----
From: AJ Chen [mailto:canovaj@gmail.com] 
Sent: Friday, September 23, 2005 1:41 PM
To: nutch-user
Subject: SocketTimeoutException


In my crawling of large number of selected sites, the number of threads 
is automatically determined by the number of pages on the fetchlist in 
each fetch/updatedb cycle. Max number of threads is set to 1000.  When 
using 1000 threads, there are lots of SocketTimeoutException in fetching

toward the end of the fetch cycle. Any suggestion for reducing 
SocketTimeoutException?  I also notice that the SocketTimeoutException 
errors are not counted in the error count for segment status. Why is
that?

Relevant parameters set:
http.timeout=10000
http.max.delays=100000
fetcher.threads.fetch= from 10 to 1000 depending on the size of
fetchlist

Appreciate your help,
AJ